Distributed Simulations with Scythe¶

Scythe is a lightweight framework for running embarrassingly parallel experiments at scale via the Hatchet distributed task queue. It handles artifact management, S3 storage, and result collection so you can focus on the simulation logic.

By combining idfkit for EnergyPlus model manipulation and simulation with Scythe for distributed orchestration, you can run large parametric studies across hundreds or thousands of building variants without writing your own queuing or storage infrastructure.

Note

Scythe is an independent project in early development. See the Scythe documentation for the latest API details and setup instructions.

Prerequisites¶

Install both packages:

pip install idfkit scythe-engine

Workers need EnergyPlus installed. The NREL Docker images are a convenient base for containerized deployments.

You also need a running Hatchet instance (self-hosted or cloud) and an S3-compatible bucket for artifacts. See hatchet-sst for a self-hosting guide.

How It Works¶

Scythe follows a scatter-gather pattern:

Define input and output schemas as Pydantic models
Register an experiment function that maps one input to one output
Allocate a batch of input specs; Scythe uploads artifacts and enqueues tasks
Workers pull tasks from the Hatchet queue and execute them
Gather results from S3 as organized Parquet files

idfkit fits into step 2: the registered experiment function uses idfkit to load an IDF, apply parameter changes, run the EnergyPlus simulation, and extract results.

Step 1: Define Input/Output Schemas¶

Schemas inherit from Scythe's ExperimentInputSpec and ExperimentOutputSpec. Use FileReference fields for files that Scythe should manage (upload to / download from S3 automatically).

from typing import Literal

from pydantic import Field
from scythe.base import ExperimentInputSpec, ExperimentOutputSpec
from scythe.utils.filesys import FileReference


class BuildingSimInput(ExperimentInputSpec):
    """Input specification for a parametric building energy study."""

    r_value: float = Field(description="Wall insulation R-value [m2K/W]", ge=0, le=15)
    lpd: float = Field(description="Lighting power density [W/m2]", ge=0, le=20)
    setpoint: float = Field(description="Cooling setpoint [deg C]", ge=18, le=30)
    economizer: Literal["NoEconomizer", "DifferentialDryBulb", "DifferentialEnthalpy"] = Field(
        description="Economizer type"
    )
    idf_file: FileReference = Field(description="Base IDF model file")
    weather_file: FileReference = Field(description="EPW weather file")
    design_day_file: FileReference = Field(description="DDY design day file")


class BuildingSimOutput(ExperimentOutputSpec):
    """Output specification with scalar results and time-series data."""

    heating_kwh_m2: float = Field(description="Annual heating [kWh/m2]", ge=0)
    cooling_kwh_m2: float = Field(description="Annual cooling [kWh/m2]", ge=0)
    lighting_kwh_m2: float = Field(description="Annual lighting [kWh/m2]", ge=0)
    fans_kwh_m2: float = Field(description="Annual fan energy [kWh/m2]", ge=0)
    total_eui: float = Field(description="Total site EUI [kWh/m2]", ge=0)
    timeseries: FileReference = Field(description="Hourly results CSV")

Key points:

Scalar fields (r_value, lpd, etc.) are collected into a Parquet table for analysis.
FileReference fields accept local paths, HTTP URLs, or S3 URIs. Scythe uploads local files to S3 at allocation time and resolves them back to local paths on the worker.
Pydantic Field constraints (ge, le) provide automatic validation.

Step 2: Register the Experiment Function¶

The experiment function receives a single BuildingSimInput and a temporary working directory, and returns a BuildingSimOutput. This is where idfkit does the heavy lifting.

@ExperimentRegistry.Register()
def simulate_building(input_spec: BuildingSimInput, tempdir: Path) -> BuildingSimOutput:
    """Run a single parametric EnergyPlus simulation using idfkit."""
    from idfkit import load_idf
    from idfkit.simulation import simulate
    from idfkit.weather import DesignDayManager

    # Load the base IDF model (FileReference resolves to a local path)
    model = load_idf(input_spec.idf_file)

    # Apply parametric overrides
    for material in model["Material"]:
        if "wall_insulation" in material.Name.lower():
            material.Thermal_Resistance = input_spec.r_value

    for lights in model["Lights"]:
        lights.Watts_per_Zone_Floor_Area = input_spec.lpd

    for thermostat in model["ThermostatSetpoint:DualSetpoint"]:
        # Adjust cooling setpoint schedule
        pass  # modify as needed for your model

    # Inject ASHRAE design days from the DDY file
    ddm = DesignDayManager(input_spec.design_day_file)
    ddm.apply_to_model(model)

    # Run the simulation
    result = simulate(
        model,
        weather=input_spec.weather_file,
        output_dir=tempdir / "run",
        annual=True,
    )

    # Extract end-use totals from the SQL output
    sql = result.sql
    rows = sql.get_tabular_data(
        report_name="AnnualBuildingUtilityPerformanceSummary",
        table_name="End Uses",
    )

    # Build a lookup: (row_name, column_name) -> value
    end_use = {(r.row_name, r.column_name): r.value for r in rows}

    # Get conditioned floor area from the building summary
    area_rows = sql.get_tabular_data(
        report_name="AnnualBuildingUtilityPerformanceSummary",
        table_name="Building Area",
    )
    floor_area = float(next(r.value for r in area_rows if r.row_name == "Net Conditioned Building Area"))

    # Write hourly time-series to CSV for the FileReference output
    csv_path = tempdir / "timeseries.csv"
    if result.csv is not None:
        with open(csv_path, "w") as f:
            f.write("timestamp," + ",".join(c.header for c in result.csv.columns) + "\n")
            for i, ts in enumerate(result.csv.timestamps):
                vals = ",".join(str(c.values[i]) for c in result.csv.columns)
                f.write(f"{ts},{vals}\n")

    return BuildingSimOutput(
        heating_kwh_m2=float(end_use.get(("Heating", "Electricity"), 0)) / floor_area,
        cooling_kwh_m2=float(end_use.get(("Cooling", "Electricity"), 0)) / floor_area,
        lighting_kwh_m2=float(end_use.get(("Interior Lighting", "Electricity"), 0)) / floor_area,
        fans_kwh_m2=float(end_use.get(("Fans", "Electricity"), 0)) / floor_area,
        total_eui=sum(float(v) for (row, col), v in end_use.items() if row == "Total End Uses" and col == "Electricity")
        / floor_area,
        timeseries=csv_path,
        dataframes={},
    )

Inside the function you have full access to the idfkit API:

load_idf() parses the model file that Scythe downloaded to a local path
Object manipulation applies parametric changes (R-values, lighting, setpoints)
DesignDayManager loads a DDY file by path and injects design days into the model
simulate() runs EnergyPlus and returns structured results
result.sql.get_tabular_data() queries the SQLite output for tabular end-use data
result.csv provides access to time-series column data

Any file you write to tempdir and return as a FileReference gets uploaded to S3 automatically.

Step 3: Prepare Weather Data¶

Use idfkit's weather module to find stations and download EPW/DDY files before allocating experiments:

from idfkit.weather import StationIndex, WeatherDownloader

# Find the closest weather station
index = StationIndex.load()
results = index.nearest(latitude=42.36, longitude=-71.06, limit=3)

# Download EPW and DDY files
downloader = WeatherDownloader()
for result in results:
    files = downloader.download(result.station)
    print(f"{result.station.display_name} ({result.distance_km:.1f} km): {files.epw}, {files.ddy}")

# Use these local paths (or upload to S3) as FileReference inputs to Scythe

You can pass local file paths, HTTP URLs, or S3 URIs as FileReference values in your input specs. Scythe handles the upload and distribution to workers.

Step 4: Allocate the Experiment¶

Create a parameter grid, build input specs, and let Scythe enqueue everything:

import itertools

import boto3
import pandas as pd
from scythe.base import BaseExperiment
from scythe.utils.recursion import RecursionMap

# Assume these are defined in your experiments module
from experiments.building_energy import BuildingSimInput, simulate_building

# Define parameter grid
r_values = [2.0, 3.5, 5.0, 7.0]
lpds = [5.0, 8.0, 12.0]
setpoints = [22.0, 24.0, 26.0]
economizers = ["NoEconomizer", "DifferentialDryBulb"]

# Build a DataFrame of all combinations
combos = list(itertools.product(r_values, lpds, setpoints, economizers))
df = pd.DataFrame(combos, columns=["r_value", "lpd", "setpoint", "economizer"])

# Add file references (same base model + weather for all runs)
df["idf_file"] = "s3://my-bucket/models/office_base.idf"
df["weather_file"] = "s3://my-bucket/weather/USA_MA_Boston-Logan.epw"
df["design_day_file"] = "s3://my-bucket/weather/USA_MA_Boston-Logan.ddy"

# Validate and allocate
specs = [BuildingSimInput.model_validate(row.to_dict()) for _, row in df.iterrows()]

experiment = BaseExperiment(experiment=simulate_building)
experiment.allocate(
    specs,
    version="bumpminor",
    s3_client=boto3.client("s3"),
    recursion_map=RecursionMap(factor=2, max_depth=3),
)

This creates 4 x 3 x 3 x 2 = 72 simulation tasks. Scythe uploads the IDF, EPW, and DDY files to S3 (deduplicating shared files), serializes the specs, and pushes work items to the Hatchet queue.

The RecursionMap controls how Scythe fans out work across workers. A factor=2, max_depth=3 configuration splits the batch into progressively smaller chunks for efficient scheduling.

Step 5: Run Workers¶

Each worker imports the registered experiments and starts the Scythe worker loop:

from scythe.worker import ScytheWorkerConfig

# Import your experiments so they are registered
from experiments.building_energy import simulate_building  # noqa: F401  # pyright: ignore[reportUnusedImport]

if __name__ == "__main__":
    worker_config = ScytheWorkerConfig()
    worker_config.start()

Workers pull tasks from the Hatchet queue, download input artifacts from S3, call the experiment function, and upload outputs back to S3.

Docker Setup¶

For containerized workers with EnergyPlus pre-installed:

FROM nrel/energyplus:24.2.0

RUN pip install idfkit scythe-engine

COPY experiments/ /app/experiments/
COPY main.py /app/main.py

WORKDIR /app
CMD ["python", "main.py"]

Scale horizontally by running multiple container instances against the same Hatchet queue.

Step 6: Gather Results¶

Once all tasks complete, Scythe organizes outputs in S3:

s3://my-bucket/experiments/<name>/<version>/
    manifest.yml              # Experiment metadata
    specs.pq                  # All input specs as Parquet
    scalars.pq                # All scalar outputs (heating, cooling, etc.)
    result_file_refs.pq       # S3 paths to output files (timeseries CSVs)
    experiment_io_spec.yml    # JSON Schema of input/output definitions

Load the scalar results directly with pandas:

import pandas as pd

scalars = pd.read_parquet("s3://my-bucket/experiments/.../scalars.pq")
print(scalars.describe())

The scalars.pq file contains a MultiIndex linking each output row back to its input parameters, making it straightforward to pivot, group, and plot results.

idfkit Features Useful in Scythe Experiments¶

idfkit Feature	Use Case in Scythe
`load_idf()` / `load_epjson()`	Load base model from `FileReference`
Object field manipulation	Apply parametric changes per spec
`DesignDayManager`	Inject design days from DDY files
`simulate()`	Run EnergyPlus inside the experiment function
`result.sql.get_tabular_data()`	Query tabular end-use summaries
`result.csv`	Extract time-series for `FileReference` output
`rotate_building()`	Orientation studies
`model.copy()`	Create variants from a shared base
Weather station search	Prepare EPW/DDY files before allocation

Tips¶

Keep experiment functions focused. Do model setup, simulation, and result extraction in the registered function. Avoid heavy post-processing; save raw outputs and analyze after gathering.
Use FileReference for large outputs. Scalar fields go into the Parquet summary; file references point to full time-series or report files in S3.
Deduplicate shared inputs. When all specs share the same IDF or EPW file, Scythe uploads it once. Use S3 URIs to avoid redundant uploads.
Pin EnergyPlus versions. Use idfkit.find_energyplus(version=...) or set ENERGYPLUS_DIR in your Docker image to ensure reproducible results.
Test locally first. Call your experiment function directly with a single spec and a temporary directory before allocating a full batch.