Distributed Simulations with Scythe¶
Scythe is a lightweight framework for running embarrassingly parallel experiments at scale via the Hatchet distributed task queue. It handles artifact management, S3 storage, and result collection so you can focus on the simulation logic.
By combining idfkit for EnergyPlus model manipulation and simulation with Scythe for distributed orchestration, you can run large parametric studies across hundreds or thousands of building variants without writing your own queuing or storage infrastructure.
Note
Scythe is an independent project in early development. See the Scythe documentation for the latest API details and setup instructions.
Prerequisites¶
Install both packages:
Workers need EnergyPlus installed. The NREL Docker images are a convenient base for containerized deployments.
You also need a running Hatchet instance (self-hosted or cloud) and an S3-compatible bucket for artifacts. See hatchet-sst for a self-hosting guide.
How It Works¶
Scythe follows a scatter-gather pattern:
- Define input and output schemas as Pydantic models
- Register an experiment function that maps one input to one output
- Allocate a batch of input specs; Scythe uploads artifacts and enqueues tasks
- Workers pull tasks from the Hatchet queue and execute them
- Gather results from S3 as organized Parquet files
idfkit fits into step 2: the registered experiment function uses idfkit to load an IDF, apply parameter changes, run the EnergyPlus simulation, and extract results.
Step 1: Define Input/Output Schemas¶
Schemas inherit from Scythe's ExperimentInputSpec and ExperimentOutputSpec.
Use FileReference fields for files that Scythe should manage (upload to / download
from S3 automatically).
from typing import Literal
from pydantic import Field
from scythe.base import ExperimentInputSpec, ExperimentOutputSpec
from scythe.utils.filesys import FileReference
class BuildingSimInput(ExperimentInputSpec):
"""Input specification for a parametric building energy study."""
r_value: float = Field(description="Wall insulation R-value [m2K/W]", ge=0, le=15)
lpd: float = Field(description="Lighting power density [W/m2]", ge=0, le=20)
setpoint: float = Field(description="Cooling setpoint [deg C]", ge=18, le=30)
economizer: Literal["NoEconomizer", "DifferentialDryBulb", "DifferentialEnthalpy"] = Field(
description="Economizer type"
)
idf_file: FileReference = Field(description="Base IDF model file")
weather_file: FileReference = Field(description="EPW weather file")
design_day_file: FileReference = Field(description="DDY design day file")
class BuildingSimOutput(ExperimentOutputSpec):
"""Output specification with scalar results and time-series data."""
heating_kwh_m2: float = Field(description="Annual heating [kWh/m2]", ge=0)
cooling_kwh_m2: float = Field(description="Annual cooling [kWh/m2]", ge=0)
lighting_kwh_m2: float = Field(description="Annual lighting [kWh/m2]", ge=0)
fans_kwh_m2: float = Field(description="Annual fan energy [kWh/m2]", ge=0)
total_eui: float = Field(description="Total site EUI [kWh/m2]", ge=0)
timeseries: FileReference = Field(description="Hourly results CSV")
Key points:
- Scalar fields (
r_value,lpd, etc.) are collected into a Parquet table for analysis. FileReferencefields accept local paths, HTTP URLs, or S3 URIs. Scythe uploads local files to S3 at allocation time and resolves them back to local paths on the worker.- Pydantic
Fieldconstraints (ge,le) provide automatic validation.
Step 2: Register the Experiment Function¶
The experiment function receives a single BuildingSimInput and a temporary
working directory, and returns a BuildingSimOutput. This is where idfkit
does the heavy lifting.
@ExperimentRegistry.Register()
def simulate_building(input_spec: BuildingSimInput, tempdir: Path) -> BuildingSimOutput:
"""Run a single parametric EnergyPlus simulation using idfkit."""
from idfkit import load_idf
from idfkit.simulation import simulate
from idfkit.weather import DesignDayManager
# Load the base IDF model (FileReference resolves to a local path)
model = load_idf(input_spec.idf_file)
# Apply parametric overrides
for material in model["Material"]:
if "wall_insulation" in material.Name.lower():
material.Thermal_Resistance = input_spec.r_value
for lights in model["Lights"]:
lights.Watts_per_Zone_Floor_Area = input_spec.lpd
for thermostat in model["ThermostatSetpoint:DualSetpoint"]:
# Adjust cooling setpoint schedule
pass # modify as needed for your model
# Inject ASHRAE design days from the DDY file
ddm = DesignDayManager(input_spec.design_day_file)
ddm.apply_to_model(model)
# Run the simulation
result = simulate(
model,
weather=input_spec.weather_file,
output_dir=tempdir / "run",
annual=True,
)
# Extract end-use totals from the SQL output
sql = result.sql
rows = sql.get_tabular_data(
report_name="AnnualBuildingUtilityPerformanceSummary",
table_name="End Uses",
)
# Build a lookup: (row_name, column_name) -> value
end_use = {(r.row_name, r.column_name): r.value for r in rows}
# Get conditioned floor area from the building summary
area_rows = sql.get_tabular_data(
report_name="AnnualBuildingUtilityPerformanceSummary",
table_name="Building Area",
)
floor_area = float(next(r.value for r in area_rows if r.row_name == "Net Conditioned Building Area"))
# Write hourly time-series to CSV for the FileReference output
csv_path = tempdir / "timeseries.csv"
if result.csv is not None:
with open(csv_path, "w") as f:
f.write("timestamp," + ",".join(c.header for c in result.csv.columns) + "\n")
for i, ts in enumerate(result.csv.timestamps):
vals = ",".join(str(c.values[i]) for c in result.csv.columns)
f.write(f"{ts},{vals}\n")
return BuildingSimOutput(
heating_kwh_m2=float(end_use.get(("Heating", "Electricity"), 0)) / floor_area,
cooling_kwh_m2=float(end_use.get(("Cooling", "Electricity"), 0)) / floor_area,
lighting_kwh_m2=float(end_use.get(("Interior Lighting", "Electricity"), 0)) / floor_area,
fans_kwh_m2=float(end_use.get(("Fans", "Electricity"), 0)) / floor_area,
total_eui=sum(float(v) for (row, col), v in end_use.items() if row == "Total End Uses" and col == "Electricity")
/ floor_area,
timeseries=csv_path,
dataframes={},
)
Inside the function you have full access to the idfkit API:
load_idf()parses the model file that Scythe downloaded to a local path- Object manipulation applies parametric changes (R-values, lighting, setpoints)
DesignDayManagerloads a DDY file by path and injects design days into the modelsimulate()runs EnergyPlus and returns structured resultsresult.sql.get_tabular_data()queries the SQLite output for tabular end-use dataresult.csvprovides access to time-series column data
Any file you write to tempdir and return as a FileReference gets uploaded
to S3 automatically.
Step 3: Prepare Weather Data¶
Use idfkit's weather module to find stations and download EPW/DDY files before allocating experiments:
from idfkit.weather import StationIndex, WeatherDownloader
# Find the closest weather station
index = StationIndex.load()
results = index.nearest(latitude=42.36, longitude=-71.06, limit=3)
# Download EPW and DDY files
downloader = WeatherDownloader()
for result in results:
files = downloader.download(result.station)
print(f"{result.station.display_name} ({result.distance_km:.1f} km): {files.epw}, {files.ddy}")
# Use these local paths (or upload to S3) as FileReference inputs to Scythe
You can pass local file paths, HTTP URLs, or S3 URIs as FileReference values
in your input specs. Scythe handles the upload and distribution to workers.
Step 4: Allocate the Experiment¶
Create a parameter grid, build input specs, and let Scythe enqueue everything:
import itertools
import boto3
import pandas as pd
from scythe.base import BaseExperiment
from scythe.utils.recursion import RecursionMap
# Assume these are defined in your experiments module
from experiments.building_energy import BuildingSimInput, simulate_building
# Define parameter grid
r_values = [2.0, 3.5, 5.0, 7.0]
lpds = [5.0, 8.0, 12.0]
setpoints = [22.0, 24.0, 26.0]
economizers = ["NoEconomizer", "DifferentialDryBulb"]
# Build a DataFrame of all combinations
combos = list(itertools.product(r_values, lpds, setpoints, economizers))
df = pd.DataFrame(combos, columns=["r_value", "lpd", "setpoint", "economizer"])
# Add file references (same base model + weather for all runs)
df["idf_file"] = "s3://my-bucket/models/office_base.idf"
df["weather_file"] = "s3://my-bucket/weather/USA_MA_Boston-Logan.epw"
df["design_day_file"] = "s3://my-bucket/weather/USA_MA_Boston-Logan.ddy"
# Validate and allocate
specs = [BuildingSimInput.model_validate(row.to_dict()) for _, row in df.iterrows()]
experiment = BaseExperiment(experiment=simulate_building)
experiment.allocate(
specs,
version="bumpminor",
s3_client=boto3.client("s3"),
recursion_map=RecursionMap(factor=2, max_depth=3),
)
This creates 4 x 3 x 3 x 2 = 72 simulation tasks. Scythe uploads the IDF,
EPW, and DDY files to S3 (deduplicating shared files), serializes the specs,
and pushes work items to the Hatchet queue.
The RecursionMap controls how Scythe fans out work across workers. A
factor=2, max_depth=3 configuration splits the batch into progressively
smaller chunks for efficient scheduling.
Step 5: Run Workers¶
Each worker imports the registered experiments and starts the Scythe worker loop:
from scythe.worker import ScytheWorkerConfig
# Import your experiments so they are registered
from experiments.building_energy import simulate_building # noqa: F401 # pyright: ignore[reportUnusedImport]
if __name__ == "__main__":
worker_config = ScytheWorkerConfig()
worker_config.start()
Workers pull tasks from the Hatchet queue, download input artifacts from S3, call the experiment function, and upload outputs back to S3.
Docker Setup¶
For containerized workers with EnergyPlus pre-installed:
FROM nrel/energyplus:24.2.0
RUN pip install idfkit scythe-engine
COPY experiments/ /app/experiments/
COPY main.py /app/main.py
WORKDIR /app
CMD ["python", "main.py"]
Scale horizontally by running multiple container instances against the same Hatchet queue.
Step 6: Gather Results¶
Once all tasks complete, Scythe organizes outputs in S3:
s3://my-bucket/experiments/<name>/<version>/
manifest.yml # Experiment metadata
specs.pq # All input specs as Parquet
scalars.pq # All scalar outputs (heating, cooling, etc.)
result_file_refs.pq # S3 paths to output files (timeseries CSVs)
experiment_io_spec.yml # JSON Schema of input/output definitions
Load the scalar results directly with pandas:
import pandas as pd
scalars = pd.read_parquet("s3://my-bucket/experiments/.../scalars.pq")
print(scalars.describe())
The scalars.pq file contains a MultiIndex linking each output row back to
its input parameters, making it straightforward to pivot, group, and plot
results.
idfkit Features Useful in Scythe Experiments¶
| idfkit Feature | Use Case in Scythe |
|---|---|
load_idf() / load_epjson() |
Load base model from FileReference |
| Object field manipulation | Apply parametric changes per spec |
DesignDayManager |
Inject design days from DDY files |
simulate() |
Run EnergyPlus inside the experiment function |
result.sql.get_tabular_data() |
Query tabular end-use summaries |
result.csv |
Extract time-series for FileReference output |
rotate_building() |
Orientation studies |
model.copy() |
Create variants from a shared base |
| Weather station search | Prepare EPW/DDY files before allocation |
Tips¶
- Keep experiment functions focused. Do model setup, simulation, and result extraction in the registered function. Avoid heavy post-processing; save raw outputs and analyze after gathering.
- Use
FileReferencefor large outputs. Scalar fields go into the Parquet summary; file references point to full time-series or report files in S3. - Deduplicate shared inputs. When all specs share the same IDF or EPW file, Scythe uploads it once. Use S3 URIs to avoid redundant uploads.
- Pin EnergyPlus versions. Use
idfkit.find_energyplus(version=...)or setENERGYPLUS_DIRin your Docker image to ensure reproducible results. - Test locally first. Call your experiment function directly with a single spec and a temporary directory before allocating a full batch.
See Also¶
- Scythe documentation
- Scythe example repository
- Batch Processing -- idfkit's built-in thread-pool batch runner
- Cloud Simulations (S3) -- Using idfkit with S3 directly
- Running Simulations -- Single simulation guide