Progressive Effects Walkthrough¶

This walkthrough shows how to progressively add modeling effects to a simple linear regression. Each step adds a new type of feature and we measure improvement via backtesting.

By the end, you will have built a model with location-specific offsets, seasonal patterns, climate covariates, and lagged disease cases.

1. Loading the Data¶

from chap_core.spatio_temporal_data.temporal_dataclass import DataSet

dataset = DataSet.from_csv("example_data/laos_subset.csv")

print("Locations:", list(dataset.keys()))
print("Period range:", dataset.period_range)
print("Number of periods:", len(dataset.period_range))

Locations: ['Bokeo', 'Savannakhet', 'Vientiane[prefecture]']
Period range: PeriodRange(Month(2010-1)..Month(2012-12))
Number of periods: 36

2. A Basic Estimator¶

We define a BasicEstimator that takes a feature extraction function. Different feature functions produce different models, while the estimator handles the training and prediction boilerplate.

import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from chap_core.datatypes import Samples


class BasicEstimator:
    def __init__(self, extract_features):
        self.extract_features = extract_features

    def train(self, data):
        df = data.to_pandas()
        X = self.extract_features(df)
        y = df["disease_cases"].values
        mask = np.isfinite(y) & np.all(np.isfinite(X.values), axis=1)
        self.model = LinearRegression().fit(X[mask], y[mask])
        return self

    def predict(self, historic_data, future_data):
        parts, future_mask = [], []
        for location in future_data.keys():
            hist = historic_data[location].to_pandas().assign(location=location)
            fut = future_data[location].to_pandas().assign(location=location)
            if "disease_cases" not in fut.columns:
                fut["disease_cases"] = np.nan
            parts.append(pd.concat([hist, fut], ignore_index=True))
            future_mask += [False] * len(hist) + [True] * len(fut)
        combined = pd.concat(parts, ignore_index=True)
        X = self.extract_features(combined).fillna(0)
        pred = np.clip(self.model.predict(X[future_mask]), 0, None)
        results, i = {}, 0
        for location in future_data.keys():
            n = len(future_data[location])
            results[location] = Samples(
                future_data[location].time_period, pred[i : i + n].reshape(-1, 1)
            )
            i += n
        return DataSet(results)

The predict method combines all locations' historic and future data into a single DataFrame before extracting features. This ensures feature columns (like location dummies) stay consistent between training and prediction, and allows lag-based features to look back into the historic window.

3. Evaluation Helper¶

We use backtest to run expanding-window cross-validation and compute mean absolute error (MAE) for each model variant:

from chap_core.assessment.prediction_evaluator import backtest


def evaluate(estimator, dataset, prediction_length=3, n_test_sets=4):
    results = list(backtest(
        estimator, dataset,
        prediction_length=prediction_length, n_test_sets=n_test_sets,
    ))
    errors = []
    for result in results:
        for location in result.keys():
            truth = result[location].disease_cases
            predicted = result[location].samples[:, 0]
            errors.extend(np.abs(truth - predicted))
    return np.mean(errors)

4. Location-Specific Offset¶

The simplest region-aware feature: one indicator variable per location. This lets the model learn a different baseline for each region.

def location_offset(df):
    return pd.get_dummies(df["location"], dtype=float)


mae = evaluate(BasicEstimator(location_offset), dataset)
print(f"Location offset MAE: {mae:.1f}")

Location offset MAE: 140.8

5. Seasonal Effect¶

Disease incidence often follows seasonal patterns. Adding month-of-year indicators captures periodic variation:

def location_and_season(df):
    location = pd.get_dummies(df["location"], dtype=float)
    month = pd.get_dummies(df["time_period"].dt.month, prefix="month", dtype=float)
    return pd.concat([location, month], axis=1)


mae = evaluate(BasicEstimator(location_and_season), dataset)
print(f"Location + season MAE: {mae:.1f}")

Location + season MAE: 130.1

6. Climate Covariates¶

CHAP provides future climate data (rainfall, temperature) at prediction time, so we can use these as features directly. This captures the relationship between climate conditions and disease incidence:

def location_season_climate(df):
    location = pd.get_dummies(df["location"], dtype=float)
    month = pd.get_dummies(df["time_period"].dt.month, prefix="month", dtype=float)
    climate = df[["rainfall", "mean_temperature"]].copy()
    return pd.concat([location, month, climate], axis=1)


mae = evaluate(BasicEstimator(location_season_climate), dataset)
print(f"Location + season + climate MAE: {mae:.1f}")

Location + season + climate MAE: 119.7

In practice, climate effects on disease are often delayed (e.g. rainfall affects mosquito breeding over weeks). You can also add lagged climate features using df.groupby("location")["rainfall"].shift(lag), but with limited data, adding many lag features risks overfitting.

7. Lagged Target (Disease Cases)¶

Past disease cases are typically the strongest predictor of future cases. However, lagged target introduces a technical difficulty: at prediction time, future disease cases are unknown.

The simplest solution is to only use lags at least as long as the forecast horizon. Since we predict 3 months ahead, lag 3 is the shortest usable lag -- its value is always known at prediction time.

def all_features(df):
    location = pd.get_dummies(df["location"], dtype=float)
    month = pd.get_dummies(df["time_period"].dt.month, prefix="month", dtype=float)
    climate = df[["rainfall", "mean_temperature"]].copy()
    lags = pd.DataFrame(index=df.index)
    lags["cases_lag3"] = df.groupby("location")["disease_cases"].shift(3)
    return pd.concat([location, month, climate, lags], axis=1)


mae = evaluate(BasicEstimator(all_features), dataset)
print(f"All features MAE: {mae:.1f}")

All features MAE: 113.5

Using shorter lags (e.g. lag 1 or 2) would require recursive forecasting: predicting one step ahead, feeding that prediction back as input, then predicting the next step. This is more complex to implement and can accumulate errors across steps.