Evaluation Pipeline¶
This document explains how model evaluation (backtesting) works internally in CHAP, with a focus on the expanding window cross-validation strategy used to split time series data.
Overview¶
The evaluation pipeline answers the question: "How well does a model predict disease cases on data it has not seen?"
It does this by:
- Splitting a historical dataset into training and test portions
- Training the model on the training data
- Generating predictions for each test window
- Comparing predictions against observed values (ground truth)
Because disease surveillance data is a time series, we cannot use random train/test splits. Instead, CHAP uses expanding window cross-validation, where the training data always precedes the test data chronologically.
Pipeline Architecture¶
The evaluation flow from entry point to results:
Evaluation.create() # evaluation.py
|
+--> backtest() # prediction_evaluator.py
| |
| +--> train_test_generator() # dataset_splitting.py
| | Returns (train_set, splits_iterator)
| |
| +--> estimator.train(train_set)
| | Returns predictor
| |
| +--> for each split:
| predictor.predict(historic, future)
| Merge predictions with ground truth
| Yield DataSet[SamplesWithTruth]
|
+--> Evaluation.from_samples_with_truth()
Wraps results in an Evaluation object
Expanding Window Cross-Validation¶
The Problem¶
Standard k-fold cross-validation randomly assigns data points to folds. This is invalid for time series because:
- Models would train on future data and predict the past
- Temporal autocorrelation would leak information between folds
The Strategy¶
CHAP uses an expanding window approach where:
- The model is trained once on an initial training set
- Multiple test windows are created by sliding forward through the data
- For each test window, the model receives all historical data up to that point
The key parameters are:
- prediction_length: how many periods each test window covers
- n_test_sets: how many test windows to create
- stride: how many periods to advance between windows
How Split Indices Are Calculated¶
The train_test_generator function computes splits from the end of the dataset working backwards:
This ensures the last test window ends at the final period of the dataset.
Concrete Example¶
Consider a dataset with 20 monthly periods (indices 0-19), prediction_length=3, n_test_sets=3, stride=1:
split_idx = -(3 + (3 - 1) * 1 + 1) = -6 -> index 14
Split 0: historic = [0..14], future = [15, 16, 17]
Split 1: historic = [0..15], future = [16, 17, 18]
Split 2: historic = [0..16], future = [17, 18, 19]
Train set = [0..14] (same as split 0 historic data)
Visually, with T = train, H = extra historic context, F = future/test:
Period: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
Train: T T T T T T T T T T T T T T T
Split 0: T T T T T T T T T T T T T T T F F F
Split 1: T T T T T T T T T T T T T T T H F F F
Split 2: T T T T T T T T T T T T T T T H H F F F
Note how the historic data expands with each split while the future window slides forward.
What the Model Sees¶
For each test split, the predictor receives:
- historic_data: full dataset (all features including disease_cases) up to the split point
- future_data (masked): future covariates (e.g. climate data) without disease_cases -- this is what the model uses to make predictions
- future_data (truth): full future data including disease_cases -- used after prediction to evaluate accuracy
Key Components¶
chap_core/assessment/dataset_splitting.py¶
Handles splitting datasets into train/test portions:
train_test_generator()-- main function implementing expanding window splitstrain_test_split()-- single split at one time pointsplit_test_train_on_period()-- generates splits at multiple split pointsget_split_points_for_data_set()-- computes evenly-spaced split points
chap_core/assessment/prediction_evaluator.py¶
Runs the model and collects predictions:
backtest()-- trains model once, yields predictions for each splitevaluate_model()-- full evaluation with GluonTS metrics and PDF report
chap_core/assessment/evaluation.py¶
High-level evaluation abstraction:
Evaluation.create()-- end-to-end factory: runs backtest and wraps resultsEvaluation.from_samples_with_truth()-- builds evaluation from raw prediction resultsEvaluation.to_file()/from_file()-- NetCDF serialization for sharing results
Code Flow: Evaluation.create()¶
Step-by-step walkthrough of what happens when Evaluation.create() is called (e.g. from the CLI chap evaluate command):
backtest()is called with the estimator and dataset- Inside
backtest(),train_test_generator()computes the split index and creates:- A training set (data up to the first split point)
- An iterator of (historic, masked_future, future_truth) tuples
- The estimator is trained once on the training set, producing a predictor
- For each test split, the predictor generates samples and they are merged with ground truth into
SamplesWithTruthobjects - Back in
create(),train_test_generator()is called again to determine the last training period from_samples_with_truth()assembles anEvaluationobject containing:BackTestwith all forecasts and observations- Historical observations for plotting context
- The
Evaluationcan then be exported to NetCDF, used for metric computation, or visualized