Evaluation Workflow: Comparing Models with CLI¶
This guide walks through the complete workflow for evaluating models, visualizing results, and comparing metrics using the CHAP CLI.
Overview¶
The workflow consists of three main steps:
- evaluate2: Run a backtest and export results to NetCDF format
- plot-backtest: Generate visualizations from evaluation results
- export-metrics: Compare metrics across multiple evaluations in CSV format
Prerequisites¶
- CHAP Core installed (see Setup guide)
- A dataset CSV file with disease case data
- A GeoJSON file with region polygons (optional, auto-discovered if named same as CSV)
Verify Installation¶
Before starting, verify that the CLI tools are installed correctly:
Example Dataset¶
CHAP includes a small example dataset for testing and learning:
example_data/laos_subset.csv- Monthly dengue data for 3 provinces (2010-2012)example_data/laos_subset.geojson- Matching polygon boundaries
This dataset contains 108 rows with rainfall, temperature, disease cases, and population data for Bokeo, Vientiane, and Savannakhet provinces.
Step 1: Create an Evaluation¶
Use evaluate2 to run a backtest on a model and export results to NetCDF format.
Standard Models (GitHub URL or Local Directory)¶
For models hosted on GitHub or cloned locally:
chap evaluate2 \
--model-name https://github.com/dhis2-chap/minimalist_example_r \
--dataset-csv ./data/vietnam_data.csv \
--output-file ./results/model_a_eval.nc \
--backtest-params.n-periods 3 \
--backtest-params.n-splits 7
Or using a local directory:
chap evaluate2 \
--model-name /path/to/minimalist_example_r \
--dataset-csv ./data/vietnam_data.csv \
--output-file ./results/model_a_eval.nc \
--backtest-params.n-periods 3 \
--backtest-params.n-splits 7
Chapkit Models¶
Chapkit models are REST API-based models that follow the chapkit specification. See Running models with chapkit for more details.
From a running chapkit service (URL):
chap evaluate2 \
--model-name http://localhost:8000 \
--dataset-csv ./data/vietnam_data.csv \
--output-file ./results/chapkit_eval.nc \
--run-config.is-chapkit-model \
--backtest-params.n-periods 3 \
--backtest-params.n-splits 7
From a local chapkit model directory (auto-starts the service):
When you provide a directory path with --run-config.is-chapkit-model, CHAP automatically:
- Starts a FastAPI dev server from the model directory using
uv run fastapi dev - Waits for the service to become healthy
- Runs the evaluation
- Stops the service when complete
chap evaluate2 \
--model-name /path/to/your/chapkit/model \
--dataset-csv ./data/vietnam_data.csv \
--output-file ./results/chapkit_eval.nc \
--run-config.is-chapkit-model \
--backtest-params.n-periods 3 \
--backtest-params.n-splits 7
Parameters¶
| Parameter | Description | Default |
|---|---|---|
--model-name |
Model path, GitHub URL, or chapkit service URL | Required |
--dataset-csv |
Path to CSV with disease data | Required |
--output-file |
Path for output NetCDF file | Required |
--backtest-params.n-periods |
Forecast horizon (periods ahead) | 3 |
--backtest-params.n-splits |
Number of train/test splits | 7 |
--backtest-params.stride |
Step size between splits | 1 |
--model-configuration-yaml |
Optional YAML with model config | None |
--run-config.is-chapkit-model |
Flag to indicate chapkit model | false |
--run-config.ignore-environment |
Skip environment setup | false |
--run-config.debug |
Enable debug logging | false |
--run-config.run-directory-type |
Directory handling: latest, timestamp, or use_existing |
timestamp |
--historical-context-years |
Years of historical data for plot context | 6 |
GeoJSON Auto-Discovery¶
If your dataset is vietnam_data.csv, CHAP will automatically look for vietnam_data.geojson in the same directory.
Step 2: Visualize the Evaluation¶
Use plot-backtest to generate visualizations from the evaluation results:
chap plot-backtest \
--input-file ./results/model_a_eval.nc \
--output-file ./results/model_a_plot.html \
--plot-type backtest_plot_1
Available Plot Types¶
| Plot Type | Description |
|---|---|
backtest_plot_1 |
Standard backtest visualization with forecasts vs observations |
evaluation_plot |
Evaluation summary plot |
ratio_of_samples_above_truth |
Shows forecast bias across locations |
Output Formats¶
The output format is determined by file extension:
.html- Interactive HTML (recommended).png- Static PNG image.svg- Vector SVG image.pdf- PDF document.json- Vega JSON specification
Step 3: Create Another Evaluation¶
Run the same process with a different model for comparison:
chap evaluate2 \
--model-name https://github.com/dhis2-chap/chap_auto_ewars_weekly \
--dataset-csv ./data/vietnam_data.csv \
--output-file ./results/model_b_eval.nc \
--backtest-params.n-periods 3 \
--backtest-params.n-splits 7
Step 4: Export and Compare Metrics¶
Use export-metrics to compute metrics from multiple evaluations and export to CSV:
chap export-metrics \
--input-files ./results/model_a_eval.nc ./results/model_b_eval.nc \
--output-file ./results/comparison.csv
Output Format¶
The CSV contains one row per evaluation with metadata and metric columns:
filename,model_name,model_version,rmse_aggregate,mae_aggregate,crps,ratio_within_10th_90th,ratio_within_25th_75th,test_sample_count
model_a_eval.nc,minimalist_example_r,1.0.0,45.2,32.1,0.045,0.85,0.65,168
model_b_eval.nc,chap_auto_ewars_weekly,2.0.0,38.7,28.4,0.038,0.88,0.70,168
Available Metrics¶
| Metric ID | Description |
|---|---|
rmse_aggregate |
Root Mean Squared Error (across all data) |
mae_aggregate |
Mean Absolute Error (across all data) |
crps |
Continuous Ranked Probability Score |
ratio_within_10th_90th |
Coverage ratio for 10th-90th percentile interval |
ratio_within_25th_75th |
Coverage ratio for 25th-75th percentile interval |
test_sample_count |
Number of test samples |
Selecting Specific Metrics¶
To export only specific metrics:
chap export-metrics \
--input-files ./results/model_a_eval.nc ./results/model_b_eval.nc \
--output-file ./results/comparison.csv \
--metric-ids rmse_aggregate mae_aggregate crps
Complete Example: Standard Models¶
Here's a complete workflow comparing two standard models using the included example dataset:
# Step 1: Evaluate first model (auto-regressive)
chap evaluate2 \
--model-name https://github.com/dhis2-chap/chap_auto_ewars \
--dataset-csv ./example_data/laos_subset.csv \
--output-file ./eval_ewars.nc \
--backtest-params.n-splits 3
# Step 2: Plot first model results
chap plot-backtest \
--input-file ./eval_ewars.nc \
--output-file ./plot_ewars.html
# Step 3: Evaluate second model (minimalist R model)
chap evaluate2 \
--model-name https://github.com/dhis2-chap/minimalist_example_r \
--dataset-csv ./example_data/laos_subset.csv \
--output-file ./eval_minimalist.nc \
--backtest-params.n-splits 3
# Step 4: Plot second model results
chap plot-backtest \
--input-file ./eval_minimalist.nc \
--output-file ./plot_minimalist.html
# Step 5: Compare metrics
chap export-metrics \
--input-files ./eval_ewars.nc ./eval_minimalist.nc \
--output-file ./model_comparison.csv
# View the comparison
cat ./model_comparison.csv
The GeoJSON file example_data/laos_subset.geojson is automatically discovered since it has the same base name as the CSV.
Complete Example: Chapkit Models¶
Here's a workflow using chapkit models, including both a running service and a local directory:
Option A: Using a running chapkit service¶
First, start your chapkit model service (e.g., using Docker):
Then run the evaluation:
# Evaluate the chapkit model
chap evaluate2 \
--model-name http://localhost:8000 \
--dataset-csv ./example_data/laos_subset.csv \
--output-file ./eval_chapkit.nc \
--run-config.is-chapkit-model \
--backtest-params.n-splits 3
# Plot results
chap plot-backtest \
--input-file ./eval_chapkit.nc \
--output-file ./plot_chapkit.html
Option B: Using a local chapkit model directory (auto-start)¶
If you have a chapkit model in a local directory, CHAP can automatically start and stop the service:
# Clone or create your chapkit model
git clone https://github.com/your-org/your-chapkit-model /path/to/chapkit-model
# Evaluate with auto-start (CHAP starts the service automatically)
chap evaluate2 \
--model-name /path/to/chapkit-model \
--dataset-csv ./example_data/laos_subset.csv \
--output-file ./eval_local_chapkit.nc \
--run-config.is-chapkit-model \
--backtest-params.n-splits 3
# Plot results
chap plot-backtest \
--input-file ./eval_local_chapkit.nc \
--output-file ./plot_local_chapkit.html
Comparing chapkit and standard models¶
You can compare chapkit models with standard models using export-metrics:
# Evaluate a standard model
chap evaluate2 \
--model-name https://github.com/dhis2-chap/minimalist_example_r \
--dataset-csv ./example_data/laos_subset.csv \
--output-file ./eval_standard.nc \
--backtest-params.n-splits 3
# Evaluate a chapkit model
chap evaluate2 \
--model-name /path/to/chapkit-model \
--dataset-csv ./example_data/laos_subset.csv \
--output-file ./eval_chapkit.nc \
--run-config.is-chapkit-model \
--backtest-params.n-splits 3
# Compare both
chap export-metrics \
--input-files ./eval_standard.nc ./eval_chapkit.nc \
--output-file ./comparison.csv
Tips¶
- Consistent parameters: Use the same
n-periodsandn-splitswhen comparing models - Same dataset: Always use identical datasets for fair comparison
- Multiple runs: Consider running evaluations with different random seeds for robustness
- Metric interpretation: Lower RMSE/MAE/CRPS is better; higher coverage ratios indicate better calibrated uncertainty
- Chapkit auto-start: When using local chapkit directories, ensure
uvis installed and the model directory has a valid FastAPI app structure with a/healthendpoint