Skip to content

Learning statistical and machine learning modelling - with climate-sensitive disease forecasting as focus and case

This material will gradually introduce you to important concepts from statistical modelling and machine learning, focusing on what you will need to understand in order to do forecasting of climate-sensitive disease. It thus selects a set of topics needed for disease forecasting, while mostly introducing these concepts in generality.

The material is organised around hands-on tutorials, where you will learn how to practically develop models while learning the theoretical underpinnings.

Prerequisites

  • Programming: you must know some programming, in either Python or R, to be able to follow our exercises and tutorials.
  • Data science: You should know basic data science and statistics or machine learning. However, very little is presumed as it can be learnt underway, but one should know the most basic concepts and terminology.
  • GitHub: Central to our approach is that you follow various practical tutorials along the way. These tutorials are available at GitHub, so you need to know at least the absolute basics of how to get code from GitHub to your local machine - if not please ask us for tips on Github tutorials.

Background

Climate change is rapidly reshaping the patterns and spread of disease, posing urgent challenges for health systems. To effectively respond, the health systems must become more adaptive and data-driven. Early warning and response systems (EWS) that leverage disease incidence forecasting offer a promising way to prioritize interventions where and when they are most needed.

At the core of early warning is forecasting of disease incidence forward in time. This is based on learning a statistical/machine learning model of how disease progresses ahead in time based on various available data.

If you have limited prior experience with statistics or machine learning, please read our brief intro in the expandable box below:

A gentle introduction to statistical and time series modelling

Click to expand ### 1. Statistical or Machine Learning Model A **model** is a rule or formula we create to describe how some outcome depends on information we already have. - The outcome we want to understand or predict is often called the **target**. - The information we use to make that prediction is called **predictors** (or **features**). A model tries to capture patterns in data in a simplified way. You can think of a model as a machine: > **input (predictors)** → *model learns a pattern* → **output (prediction)** The goal is either to **explain** something (“What affects sales?”) or to **predict** something (“What will sales be tomorrow?”). --- ### 2. Predictors (Features) A **predictor** is any variable that provides information helpful for predicting the target. Examples: - Temperature when predicting ice cream sales - Age when predicting income - Yesterday’s stock price when predicting today’s Predictors are the model’s *inputs*. We usually write them as numbers: - A single predictor as *x* - Several predictors as *x₁, x₂, x₃,* … The model learns how each predictor is related to the target. --- ### 3. Linear Regression **Linear regression** is one of the simplest and most widely used models. It assumes that the target is approximately a **straight-line combination** of its predictors. With one predictor *x*, the model is: > **prediction = a + b·x** - **a** is the model’s baseline (what we predict when *x = 0*) - **b** tells us how much the prediction changes when *x* increases by 1 unit With multiple predictors *x₁, x₂, …*, we extend the same idea: > **prediction = a + b₁·x₁ + b₂·x₂ + …** You don’t need to imagine shapes in many dimensions—just think of it as a recipe where each predictor gets a weight (**b**) that shows how important it is. The model “learns” values of *a, b₁, b₂, …* by choosing them so that predictions are as close as possible to the observed data. --- ### 4. Time Series A **time series** is a sequence of data points collected over time, in order: > value at time 1, value at time 2, value at time 3, … Examples: - Daily temperatures - Hourly website traffic - Monthly number of customers What makes time series special is that: - **The order matters** - **Past values can influence future values** - Data may show **patterns** such as trends (general increase/decrease over time) or seasonality (repeating patterns, like higher electricity use every winter) --- ### 5. Time Series Forecasting **Forecasting** means using past observations to predict future ones. Unlike models that treat each data point separately, forecasting models learn ideas like: - how the series tends to move (trend) - whether it repeats patterns (seasonality) - how strongly the recent past influences the next value A simple forecasting idea is to predict the next value using a weighted average of recent past values. More advanced methods learn more complex patterns automatically. --- ### 6. Evaluation of Predictions Once a model makes predictions, we need to measure how good they are. This means comparing the model’s predictions to the actual values. Let: - **actual** value = *y* - **predicted** value = ŷ (read as “y-hat”) The **error** is: > **error = actual − predicted = y − ŷ** Common ways to summarize how large the errors are: - **MAE (Mean Absolute Error):** average of |y − ŷ| (the average size of the mistakes) - **MSE (Mean Squared Error):** average of (y − ŷ)² (large mistakes count extra) - **RMSE (Root Mean Squared Error):** the square root of MSE (in the same units as the data) - **MAPE (Mean Absolute Percentage Error):** how large the errors are *relative* to the actual values, in % These measures help us compare models and choose the one that predicts best. For a bit more in-depth introduction, please also consider the following general papers: https://pmc.ncbi.nlm.nih.gov/articles/PMC5905345/ https://www.nature.com/articles/nmeth.3627 https://www.nature.com/articles/nmeth.3665

Motivation

Our tutorial aims to introduce aspects of statistical modelling and machine learning that are useful specifically for developing, evaluating and later operationalising forecasting models. Our pedagogical approach is to begin by introducing a very simple model in a simple setting, and then expanding both the model and the setting in a stepwise fashion. We emphasize interoperability and rigorous evaluation of models right from the start, as a way of guiding the development of more sophisticated models. In doing this, we follow a philosophy resembling what is known as agile development in computer science [ref]. To facilitate interoperability and evaluation of models, we rely on the Chap platform [https://dhis2-chap.github.io/chap-core], which enforces standards of interoperability already from the first, simple model. This interoperability allows models to be run on a broad data collection and be rigorously evaluated with rich visualisations of data and predictions already from the early phase.

Making your first model and getting it into chap

Disease forecasting is a type of problem within what is known as spatiotemporal modelling [ref] in the field of statistics/ML. What this means is that the data have both a temporal and spatial reference (i.e. data of disease incidence is available at multiple subsequent time points, for different regions in a country), where observations that are close in time or space are usually considered more likely to be similar [FINN BEDRE FORMULERING]. In our case, we also have data both on disease and on various climate variables that may influence disease incidence.

Before going into the many challenges of spatiotemporal modelling, we recommend that you get the technical setup in place to allow efficient development and learning for the remainder of this tutorial. Although this can be a bit of a technical nuisance just now, it allows you to run your model on a variety of data inputs with rich evaluation now already, and it allows you to progress efficiently with very few technical wrinkles as the statistical and ML aspects become more advanced. To do this, please follow our minimalist example tutorial, which introduces an extremely oversimplified statistical model (linear regression of immediate climate effects on disease only), but shows you how to get this running in Chap. This minimalist tutorial is available both as Python and as R code: * https://github.com/dhis2-chap/minimalist_example (Python) * https://github.com/dhis2-chap/minimalist_example_r (R)

Evaluating a model

The purpose of spatiotemporal modelling is to learn generalisable patterns that can be used to reason about unseen regions or about the future. Since our use case is an early warning system, our focus is on the latter, i.e. forecasting disease incidence ahead in time based on historic data for a given region. Therefore, we will focus on evaluating a model through its forecasting skill into the future.

A straightforward way to assess a forecasting model is to create and record forecasts for future disease development, wait to see how disease truly develops, and then afterwards compare the observed numbers to what was forecasted. This approach has two main limitations, though: it requires to wait through the forecast period to see what observations turn out to be it only shows the prediction skill of the forecast model at a single snapshot in time - leaving a large uncertainty on how a system may be expected to behave if used to forecast new future periods.

A popular and powerful alternative is thus to perform what is called backtesting or hindcasting: one pretends to be at a past point in time, providing a model exclusively with data that was available before this pretended point in time, making forecasts beyond that time point (for which no information was available to the model), and then assessing how close this forecast is to what happened after the pretended time point. When performed correctly, this resolves the both mentioned issues with assessing forecasts truly made into the future: Since observations after the pretended time point is already available in historic records, assessment can be performed instantaneously, and one can choose several different pretended time points, reducing uncertainty of the estimated prediction skill and also allowing to see variability in prediction skill across time.

To be truly representative of true future use, it is crucial that the pretended time point for forecasting realistically reflects a situation where the future is not known. There are a myriad pitfalls in assessment setup that can lead to the assessment not respecting the principle of future information beyond the pretended time point not being accessible to models. This is discussed in more detail here: https://docs.google.com/document/d/1Hr7Wz4Yc4ZKZ6fsFJI_lpLO8d1SSxtfTeqByWZrQqFo/edit?tab=t.0

Prediction skill can be measured in different ways. One simple way to measure this is to look at how far off the predictions are, on average, from the true values (known as mean absolute error, MAE). Other common measures are discussed later in this tutorial, after we have introduced further aspects of modelling. To make the most of the data we have, we often use a method called cross-validation. This means we repeatedly split the data into “past” and “future” parts at different time points. We then make forecasts for each split and check how accurate those forecasts are. This helps us see how well the model performs across different periods of time. To learn more, Wikipedia has a broad introduction, including specifics for time series models: https://en.wikipedia.org/wiki/Cross-validation_(statistics)#Cross_validation_for_time-series,_spatial_and_spatiotemporal_models

Since we are here running our models through Chap, we can lean on an already implemented solution to avoid pitfalls and ensure a data-efficient and honest evaluation of the models we develop. Chap also includes several metrics, including MAE, and offers several convenient visualisations to provide insights on the prediction skill. To get a feeling for this, please follow the code-along tutorials on assessment with Chap. We recommend to start with a simple code-along tutorial for how to split data and compute MAE on the pretended future: https://github.com/dhis2-chap/Assessment_example_singlepred

After getting a feeling for assessment, please continue with our code tutorial on how to perform more sophisticated evaluation of any model using the built-in chap evaluation functionality (rather than implementing your own simple evaluation from scratch, as in the previous code-along tutorial): https://github.com/dhis2-chap/Assessment_example_chap_compatible

With this evaluation setup in place, you should be ready to start looking at more sophisticated modelling approaches. For each new version of your model, evaluation helps you check that the new version is actually an improvement (and if so, in which sense, for instance short-term vs long-term, accuracy vs calibration, large vs small data set).

Expanding your model to make it more meaningful

Multiple regions

While it may in some settings be useful to forecast disease at a national level, it is often more operationally relevant to create forecasts for smaller regions within the country, for instance at district level. Therefore, a disease forecasting approach needs to relate to multiple districts in the code, to create forecasts per district. If a single model is trained across district data (ignoring the fact that there are different districts) and used to directly forecast disease without taking into account differences between districts in any way, it would forecast similar levels of disease across districts regardless of disease prevalence in each particular district. To see this more concretely please follow this tutorial to see the errors that the minimalist_example model (which ignores districts) makes for the two districts D1 and D2 with respectively high and low prevalence: https://github.com/dhis2-chap/Assessment_example_multiregion

The simplest approach to creating meaningful region-level forecast is to simply train and forecast based on a separate model per region. Please follow the tutorial below (in Python or R version) to see an easy way of doing this in code: * https://github.com/dhis2-chap/minimalist_multiregion (Python) * https://github.com/dhis2-chap/minimalist_multiregion_r (R)

However, such independent consideration of each region also has several weaknesses. A main weakness is connected to the amount of data available to learn a good model. When learning a separate model per region, there may be a scarcity of data to fit a model of the desired complexity. This relates to a general principle in statistics concerning the amount of data available to fit a model versus the number of parameters in a model (see e.g.:,https://www.researchgate.net/publication/307526307_Points_of_Significance_Model_selection_and_overfitting) More available data allows for a larger number of parameters and a more complex model. Compared to the case with separate models per region, just combining data across all regions into a single model where the parameters are completely independent between districts does not change the ratio of available data versus parameters to be estimated. However, if the parameters are dependent (for example due a spatial structure, i.e. similarities between regions), the effective number of parameters will be lower than the actual number. There is a trade-off between having completely independent parameters on one hand and, and forcing parameters to be equal across regions on the other hand. This is often conveyed as “borrowing strength” between regions. It is also related to the concept of the bias-variance tradeoff in statistics and machine learning (see e.g.: https://en.wikipedia.org/wiki/Bias%E2%80%93variance_tradeoff ), where dependency between parameter values across regions introduces a small bias in each region towards the mean across regions, while reducing variance of the parameter estimates due to more efficient use of data.

As the disease context (including numbers of disease cases) can vary greatly between regions, the same type of model is not necessarily suited for all regions. However, taking this into account can be complex, especially if one wants to combine flexibility of models with the concept of borrowing strength mentioned above. It is thus often more practical to use a single model across regions, but ensure that this model is flexible enough to handle such heterogeneity across regions.

Lags and autoregressive effect

The incidence of disease today is usually affected by the incidence of disease in the recent past. This is for instance almost always the case with infectious disease, whether or not transmission is directly human-to-human or via a vector like mosquitoes (e.g. Malaria and Dengue). Thus, it is usually advisable to include past cases of disease as a predictor in a model. This is typically referred to as including autoregressive effects in models.

Additionally, climate variables such as rainfall and temperature usually don’t have an instantaneous effect. Typically, there is no way that (for example) rainfall today would influence cases today or in the nearest days. Instead, heavy rainfall today could alter standing water, affecting mosquito development and behavior, with effects on reported Malaria cases being several weeks ahead. This means that models should typically make use of past climatic data to predict disease incidence ahead. The period between the time point that a given data value reflects and the time point of its effect is referred to as a lag. The effect of a predictor on disease may vary and even be opposite when considered at different lags. A model should usually include predictors at several different time lags, where the model aims to learn which time lags are important and what the effects are for a given predictor at each such lag.

Available data can be on different time resolutions. In some contexts, reported disease cases are available with specific time stamps, but more often what is collected is aggregated and made available at weekly or monthly resolution. The available data resolution influences how precisely variation in predictor effects across lag times can be represented.

Basically, the influence of each predictor at each time lag will be represented by a separate parameter in a model. As discussed in the previous section on multiple regions, this can lead to too many parameters to be effectively estimated. It is thus common to employ some form of smoothing of the effects of a predictor across lags, with a reasoning similar to that of borrowing strength across regions (here instead borrowing strength across lag times).

At a practical level, model training is often assuming that all predictors to be used for a given prediction (including lagged variables) are available at the same row of the input data table. It is thus common to modify the data tables to make lagged predictors line up. To see a simple way of doing this in code, follow this tutorial in Python or R: * https://github.com/dhis2-chap/minimalist_example_lag (Python) * https://github.com/dhis2-chap/minimalist_example_lag_r (R)

Expanding your model with uncertainty

  • What and why uncertainty
  • How to generically represent uncertainty: provide an empirical distribution (samples)
  • Add this by adding simple stochastic noise term to a single deterministic prediction from a model

Evaluating predicted uncertainty

  • The concept of uncertainty calibration (and why it is operationally important)

Honest and sophisticated evaluation in more detail

  • (see also separate manuscript about this..) *Time series cross-validation.. (growing/rolling window)

Relating to data quality issues, including missing data

More realistic (Sophisticated) models

BayesianHierarchicalModelling, including INLA, mentioning STAN Auto-regressive ML ARIMA and variants Deep learning

Systematic evaluation on simulated data for debugging, understanding and stress-testing models

  • Simulating from the model or something presumed close to the model - see if the model behaves as expected and how much data is needed to reach desired behavior..
  • Simulate specific scenarios to stress-test models
  • Develop more realistic simulations

Selecting meaningful features (covariates) and data

Further resources

Some general resources - not properly screened for now: * https://tjfisher19.github.io/introStatModeling/ * https://www.statlearning.com/ * https://www.kaggle.com/code/iamleonie/intro-to-time-series-forecasting * https://www.kaggle.com/code/iamleonie/time-series-forecasting-building-intuition * https://www.nature.com/articles/nmeth.4014 * https://www.nature.com/articles/nmeth.2613