Skip to contents

This tutorial walks you through building a Chap-compatible model step-by-step, using a validation-first approach to ensure your model works correctly before deploying it.

Setup

library(chapr)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

Step 1: Understand the Function Interface

Every Chap model requires two functions:

Training function:

train_fn <- function(training_data, model_configuration = list(), run_info = list()) {
  # training_data: tsibble with time_period index, location key,
  #                disease_cases, and covariates
  # model_configuration: optional list of parameters
  # run_info: runtime info from Chap (prediction_length, additional_continuous_covariates)
  # Returns: any model object (list, fitted model, etc.)
}

Prediction function:

predict_fn <- function(historic_data, future_data, saved_model,
                       model_configuration = list(), run_info = list()) {
  # historic_data: tsibble with historical observations
  # future_data: tsibble with time periods to predict
  # saved_model: the object returned by train_fn
  # run_info: runtime info from Chap
  # Returns: tibble with samples list-column containing numeric vectors
  #   - For deterministic models: single sample per row (e.g., samples = list(c(42)))
  #   - For probabilistic models: multiple samples per row (e.g., 1000 samples)
  #
  # IMPORTANT: historic_data may contain more recent data than training_data.
  # For time series models, you should refit to historic_data before forecasting.
  # Use saved_model for hyperparameters/structure, not the fitted model itself.
  # See examples/arima_model/ for a demonstration of this pattern.
}

Step 2: Explore the Example Data

The SDK provides example datasets for testing. Let’s examine the Laos monthly data:

data <- get_example_data('laos', 'M')
#> Registered S3 method overwritten by 'tsibble':
#>   method               from 
#>   as_tibble.grouped_df dplyr
names(data)
#> [1] "training_data" "historic_data" "future_data"   "predictions"

The example data contains four tsibbles. Each has time_period as the index and location as the key:

Training data - what your model learns from:

data$training_data
#> # A tsibble: 1,057 x 12 [1M]
#> # Key:       location [7]
#>    time_period rainfall mean_temperature disease_cases population parent
#>          <mth>    <dbl>            <dbl>         <dbl>      <dbl> <chr> 
#>  1    2000 Jul   430.               23.4             0     58503. -     
#>  2    2000 Aug   322.               23.8             0     58503. -     
#>  3    2000 Sep   265.               22.7             0     58503. -     
#>  4    2000 Oct   103.               22.6             0     58503. -     
#>  5    2000 Nov    19.7              20.3             0     58503. -     
#>  6    2000 Dec    26.0              19.1             0     58503. -     
#>  7    2001 Jan    17.6              19.8             0     60157. -     
#>  8    2001 Feb     7.28             22.0             0     60157. -     
#>  9    2001 Mar   123.               22.6             0     60157. -     
#> 10    2001 Apr    29.6              27.5             0     60157. -     
#> # ℹ 1,047 more rows
#> # ℹ 6 more variables: location <chr>, Cases <dbl>, E <dbl>, month <dbl>,
#> #   ID_year <dbl>, ID_spat <chr>

Historic data - observations available at prediction time:

data$historic_data
#> # A tsibble: 1,071 x 12 [1M]
#> # Key:       location [7]
#>    time_period rainfall mean_temperature disease_cases population parent
#>          <mth>    <dbl>            <dbl>         <dbl>      <dbl> <chr> 
#>  1    2000 Jul   430.               23.4             0     58503. -     
#>  2    2000 Aug   322.               23.8             0     58503. -     
#>  3    2000 Sep   265.               22.7             0     58503. -     
#>  4    2000 Oct   103.               22.6             0     58503. -     
#>  5    2000 Nov    19.7              20.3             0     58503. -     
#>  6    2000 Dec    26.0              19.1             0     58503. -     
#>  7    2001 Jan    17.6              19.8             0     60157. -     
#>  8    2001 Feb     7.28             22.0             0     60157. -     
#>  9    2001 Mar   123.               22.6             0     60157. -     
#> 10    2001 Apr    29.6              27.5             0     60157. -     
#> # ℹ 1,061 more rows
#> # ℹ 6 more variables: location <chr>, Cases <dbl>, E <dbl>, month <dbl>,
#> #   ID_year <dbl>, ID_spat <chr>

Notice that historic_data extends beyond training_data:

# Training data ends at:
max(data$training_data$time_period)
#> <yearmonth[1]>
#> [1] "2013 Jan"

# Historic data ends at:
max(data$historic_data$time_period)
#> <yearmonth[1]>
#> [1] "2013 Mar"

This is a key concept: when Chap calls your prediction function, historic_data may contain more recent observations than what the model was trained on. For time series models (ARIMA, exponential smoothing, etc.), you should refit the model to historic_data before forecasting. See examples/arima_model/ for a demonstration of this pattern.

Future data - time periods to predict (no disease_cases):

data$future_data
#> # A tsibble: 21 x 10 [1M]
#> # Key:       location [7]
#>    time_period rainfall mean_temperature population parent location      E month
#>          <mth>    <dbl>            <dbl>      <dbl> <chr>  <chr>     <dbl> <dbl>
#>  1    2013 Apr     39.5             26.8     80014. -      Bokeo    8.00e4     4
#>  2    2013 May    170.              25.8     80014. -      Bokeo    8.00e4     5
#>  3    2013 Jun    231.              24.7     80014. -      Bokeo    8.00e4     6
#>  4    2013 Apr    152.              27.2    731598. -      Champas… 7.32e5     4
#>  5    2013 May    236.              26.3    731598. -      Champas… 7.32e5     5
#>  6    2013 Jun    327.              25.1    731598. -      Champas… 7.32e5     6
#>  7    2013 Apr     58.8             25.1    124396. -      LouangN… 1.24e5     4
#>  8    2013 May    162.              24.7    124396. -      LouangN… 1.24e5     5
#>  9    2013 Jun    184.              24.0    124396. -      LouangN… 1.24e5     6
#> 10    2013 Apr     72.5             25.0    282683. -      Oudomxai 2.83e5     4
#> # ℹ 11 more rows
#> # ℹ 2 more variables: ID_year <dbl>, ID_spat <chr>

Example predictions - what your model should output:

data$predictions
#> # A tibble: 21 × 3
#>    time_period location     samples      
#>    <chr>       <chr>        <list>       
#>  1 2013-04     Bokeo        <dbl [1,000]>
#>  2 2013-05     Bokeo        <dbl [1,000]>
#>  3 2013-06     Bokeo        <dbl [1,000]>
#>  4 2013-04     Champasak    <dbl [1,000]>
#>  5 2013-05     Champasak    <dbl [1,000]>
#>  6 2013-06     Champasak    <dbl [1,000]>
#>  7 2013-04     LouangNamtha <dbl [1,000]>
#>  8 2013-05     LouangNamtha <dbl [1,000]>
#>  9 2013-06     LouangNamtha <dbl [1,000]>
#> 10 2013-04     Oudomxai     <dbl [1,000]>
#> # ℹ 11 more rows

The predictions tibble has a samples list-column where each element is a numeric vector. Let’s look at the structure:

# Each row has a vector of samples
data$predictions$samples[[1]]
#>    [1]  9  5 46  5  3  8  1  9  9 14  3  7  6  7  3 10 11  2 17 13 12  0  3  1
#>   [25] 10 38 19  7 15  0  1 11  1  8  7 43 11  3  4  8  1  4  3  7 24  4  1  6
#>   [49] 11  3  9  0  0 16  0 11 17  1  3  8 13  5  5 22  2  1  5 26 11 31  4  3
#>   [73] 10  7  4  6  9 13  1  5 14 13 14  8  4  1  0  7 31  6  6  2 16 12  2  9
#>   [97]  1  5  7  3  6  4 22  1  3 10  1  0  3  7 10  3  0  8 30  8  7  1 17 14
#>  [121] 10  8  4  8  2  5 20  3 11  7  6 11  2  0  4  3  4  5  3  8  4 21  5  7
#>  [145] 15  8 17  8  9 12  6  1 11  3 21  4 10  2  3  4 17  2  1 11 33  6 10  2
#>  [169] 25 12  3  0  3  1  5 16 22  0  2 24 15 13  3  1  4  0  6  2  8  1  2  1
#>  [193] 21 17  0  6  8  9  8  2  0  3 21  3 14  5  4  6  5 18 14 17  4  2 10  9
#>  [217]  7  2  3  2  3  3  1  4  3  5  3  8  1 15  4  5 11 16  8 26  0  6  3 10
#>  [241]  4 10  2  3  0  8  3 18  2  8  2  1  4  2  6  5 10 13 10  8  9  3  1  1
#>  [265]  6  7  7  9 13  8  1  7 28 23  6  7 12  8  1  4  1  6  1  2  0  6  2 16
#>  [289] 44  2 20 24 14  9  2  4 12  7  6 23 39 16 17  5 27 17  2  4  7  3  1  0
#>  [313]  6  1 10  3  6 19  2  3  6 13 14  5  8  1 14  6 15  5  4  3 30  7  9  4
#>  [337] 17  3  4  4  1  6 11 15  4 11 15  1  3  8  4 15 15  0  7 13 14 26  6  1
#>  [361]  5 15  1 21 13  6 22  6  1 19  7  0 11 13 10  1  0  2  1  7 13  1  1  9
#>  [385] 12 24  7  2  8  9  2  0  5  0  2  2  2  7 15 10 15 10  7  1 26  2 20 11
#>  [409] 41 15 13  4  7  1  8  3  4 16 27  0  1  2  3 22 20  8  3  9  7  9  3 44
#>  [433]  0 17  5  1  8 17 18  8 27 16  4  7  1  4  1 13 22  0  7  2 12 18 49  9
#>  [457]  1  1  2 16  2  5  2 12  3  8  3  5  1  1  0 28 22  1  4  6 11  5 13 24
#>  [481]  4  2 13  1  5  1  5 12  6  5  6  5 14  9  2 15  2 27 25  9  5  5  3  6
#>  [505]  3 25  9 16  2 14  6  5 25 11  4 10  2 11  5  6 16  3  2  6  3  4  5 18
#>  [529]  9  2  8  6  3 11  7 16 22  4  0  3  3 19 10  3  2  5  3 10  4 18  1  6
#>  [553]  5 19  3  1  8  2  6  5 12 15 11  1  0  0 22  6  7  2 40 21 20  2  9  9
#>  [577] 11  0  0  2  6 12  6  5 19  8 17  7  3 39  0  6 11  8  9  8 10  6  5  2
#>  [601]  3 11  3 18  0  3  5  5  1  5  2  1 37 15  7  1  0  3  1 17  1 19  3 20
#>  [625] 12 11  8  2  3  2  2  0  9  1 17 14  4 10 10 13  5 12  8  0  5  9 41  9
#>  [649] 38  8 13  6 15  7  8  4 17  1  3  9  1 33  9  5 21 27  3 22 10 16  6 21
#>  [673]  0 12 22  1  8 22  1  1 62 13 17 13 15  6 37  2 19  5  3  6  3  1  1 15
#>  [697] 31  7  3  7 15  6  1  4  5  3  6  9 17  0  5  5  1  9  5  2  0 26  1  4
#>  [721]  0  1 14  8 10  4  2  7  3  4 17  3  2 93  6  7 11  1  5 16  7 11  5 15
#>  [745]  7 15  6  4  3 38  0  3 17  4  1  0 13  5  2 10  7  2  1  8  8  6  3  0
#>  [769]  2  1  9  0  7 13  5  4  2 11  5  4  6  3  9 18  2  7  2 16  3  2  5 26
#>  [793]  3  5 18  4 25  5 12  8  2 14 15  1  5 19 14 21  0  3 26  3  2 17 39 14
#>  [817] 21 21  3 10 19 12  3  1  6  0  3 24 59  2 15 12  3  9  7 16  3 16  6  6
#>  [841]  0 15  8 12 17  6  3 12  2  2 15 10 11  1  1  1 16  2  7  0 11 39  6  2
#>  [865]  5  6 13  4 11  2  4  4  9 38  1  8 16 15 21 15  7 16  2  3 10 33  9  7
#>  [889]  4  2  6 12  9  1 16  4  5  7  3  3  4  1  3  5  9  5  9 14 23  3 21  2
#>  [913]  4  6  6 26  5 11  0  7  5  4  9  2 26  5  5 10 16 17 30  2  8 10  5 25
#>  [937]  1  1 47  7  2  4  2  0  7  0 20  1 19  2  6  5 15  5 14  5  7  7  4  2
#>  [961]  0  9  3  3 15  0  2  0  5 14  1 44 16  0 14  2  1 15  4 29 10 21  6  6
#>  [985]  0  6 24  7 16  0 12  5  2  6  5 31  9 14  0  3

For probabilistic models, each vector contains multiple Monte Carlo samples (e.g., 1000). For deterministic models, use a single sample per row: samples = list(c(42)).

Step 3: Validate Before Implementing

Before writing any model logic, let’s see what the validation expects. Start with stub functions:

train_fn <- function(training_data, model_configuration = list(), run_info = list()) {
  list(dummy = 1)
}

predict_fn <- function(historic_data, future_data, saved_model,
                       model_configuration = list(), run_info = list()) {
  future_data
}

result <- validate_model_io(train_fn, predict_fn, data)
result$success
#> [1] FALSE
result$errors
#> [1] "Predictions must have a 'samples' list-column containing numeric vectors"

The validation tells us exactly what’s missing: the samples list-column in predictions.

Step 4: Implement a Simple Mean Model

Now let’s implement a minimal model that predicts the historical mean for each location. Since all models must return a samples list-column, we wrap the single prediction value in a list:

train_fn <- function(training_data, model_configuration = list(), run_info = list()) {
  means <- training_data |>
    as_tibble() |>
    summarise(mean_cases = mean(disease_cases, na.rm = TRUE), .by = location)
  list(means = means)
}

predict_fn <- function(historic_data, future_data, saved_model,
                       model_configuration = list(), run_info = list()) {
  future_data |>
    left_join(saved_model$means, by = "location") |>
    mutate(samples = purrr::map(mean_cases, ~c(.x))) |>
    select(-mean_cases)
}

Note: We use as_tibble() in the training function because summarise(.by = ...) needs a tibble to collapse across the time dimension. The prediction function works directly on tsibbles since left_join() and mutate() preserve tsibble structure.

Step 5: Validate the Implementation

result <- validate_model_io(train_fn, predict_fn, data)
result$success
#> [1] TRUE
result$n_predictions
#> [1] 21

The validation passes and we generated 21 predictions.

Step 6: Validate Against All Datasets

The SDK can validate against all available example datasets:

result <- validate_model_io_all(train_fn, predict_fn)
result$success
#> [1] TRUE
names(result$results)
#> [1] "laos_M"

Step 7: Create the CLI

Once validation passes, wrap your model in a CLI.

First, create a new directory for your model project:

In terminal:

mkdir my_model
cd my_model

Then create a file called model.R inside this directory:

library(chapr)
library(dplyr)

train_fn <- function(training_data, model_configuration = list(), run_info = list()) {
  means <- training_data |>
    as_tibble() |>
    summarise(mean_cases = mean(disease_cases, na.rm = TRUE), .by = location)
  list(means = means)
}

predict_fn <- function(historic_data, future_data, saved_model,
                       model_configuration = list(), run_info = list()) {
  future_data |>
    left_join(saved_model$means, by = "location") |>
    mutate(samples = purrr::map(mean_cases, ~c(.x))) |>
    select(-mean_cases)
}

if (!interactive()) {
  create_chap_cli(train_fn, predict_fn)
}

Step 8: Use the CLI

Your model is now ready for command-line use. First, export the example data to CSV files for testing.

In R:

# Export example data for testing
data <- get_example_data('laos', 'M')
write.csv(as.data.frame(data$training_data), "training_data.csv", row.names = FALSE)
write.csv(as.data.frame(data$historic_data), "historic.csv", row.names = FALSE)
write.csv(as.data.frame(data$future_data), "future.csv", row.names = FALSE)

Now you can test the CLI.

In terminal:

# Train the model
Rscript model.R train --data training_data.csv

# Generate predictions
Rscript model.R predict --historic historic.csv --future future.csv \
    --output predictions.csv

# Display model info
Rscript model.R info

The CLI automatically handles:

  • Loading CSV files
  • Converting to tsibbles
  • Saving the model as RDS
  • Writing predictions to CSV (converting nested samples to wide format)

Step 9: Create MLproject for chap-core

To run your model with chap-core, you need an MLproject file. The SDK can generate this automatically using generate_mlproject().

Prerequisites

First, set up renv for reproducible dependencies:

renv::init()
renv::install("dhis2-chap/chap_r_sdk")
renv::snapshot()

Generate the MLproject File

library(chapr)

generate_mlproject(model_name = "my_mean_model")

This creates an MLproject file like:

name: my_mean_model
renv_env: renv.lock
entry_points:
  train:
    parameters:
      train_data: str
      model: str
    command: Rscript model.R train --data {train_data} --model {model}
  predict:
    parameters:
      historic_data: str
      future_data: str
      model: str
      out_file: str
    command: Rscript model.R predict --historic {historic_data} --future {future_data}
      --model {model} --output {out_file}

Run with chap-core

Once you have the MLproject and renv.lock files, run your model with chap-core:

chap evaluate --model-name ./my_model_directory --dataset-csv data.csv

See examples/arima_model/ for a complete example with renv and MLproject.

Probabilistic Models

For probabilistic forecasting, include multiple Monte Carlo samples instead of a single value:

predict_fn <- function(historic_data, future_data, saved_model,
                       model_configuration = list(), run_info = list()) {
  n_samples <- 1000

  future_data |>
    left_join(saved_model$means, by = "location") |>
    rowwise() |>
    mutate(
      # Generate 1000 samples from Poisson distribution
      samples = list(rpois(n_samples, lambda = mean_cases))
    ) |>
    ungroup() |>
    select(-mean_cases)
}

The samples column is a list-column where each element is a numeric vector. The CLI automatically converts this to wide CSV format (sample_0, sample_1, …) for Chap.

Working with Samples

The SDK provides utility functions for working with sample-based predictions:

# Convert nested samples to wide format
wide_preds <- predictions_to_wide(nested_preds)

# Convert to long format for scoringutils
long_preds <- predictions_to_long(nested_preds)

# Compute quantiles for hub submissions
quantile_preds <- predictions_to_quantiles(nested_preds)

# Add summary statistics (mean, median, CIs)
preds_with_summary <- predictions_summary(nested_preds)

Configuration Schemas

You can define a configuration schema to validate user-provided settings and provide default values. The SDK uses JSON Schema (draft-07) for validation.

Defining a Schema

Use the schema helper functions to define type-safe configuration options:

my_schema <- create_config_schema(
  title = "My Model Configuration",
  description = "Configuration options for my forecasting model",
  properties = list(
    # Integer with range constraints
    n_samples = schema_integer(
      description = "Number of Monte Carlo samples for predictions",
      default = 100L,
      minimum = 1L,
      maximum = 10000L
    ),
    # Number (float) with bounds
    learning_rate = schema_number(
      description = "Learning rate for optimization",
      default = 0.01,
      minimum = 0,
      maximum = 1
    ),
    # Enum: one of a fixed set of values
    method = schema_enum(
      values = c("arima", "ets", "prophet"),
      description = "Forecasting method to use",
      default = "arima"
    ),
    # Boolean flag
    use_covariates = schema_boolean(
      description = "Whether to include covariates in the model",
      default = TRUE
    ),
    # String with optional pattern constraint
    date_format = schema_string(
      description = "Date format for output",
      default = "%Y-%m-%d"
    ),
    # Array of values
    lag_values = schema_array(
      items = list(type = "integer"),
      description = "Lag periods to include",
      default = list(1L, 2L, 3L)
    )
  ),
  required = c("n_samples")  # Mark required fields
)

# View the schema
print(my_schema)
#> Chap Configuration Schema
#> =========================
#> 
#> Title: My Model Configuration 
#> Description: Configuration options for my forecasting model 
#> 
#> Properties:
#>   n_samples * (integer) [default: 100]
#>     Number of Monte Carlo samples for predictions
#>   learning_rate (number) [default: 0.01]
#>     Learning rate for optimization
#>   method (enum(arima, ets, prophet)) [default: "arima"]
#>     Forecasting method to use
#>   use_covariates (boolean) [default: true]
#>     Whether to include covariates in the model
#>   date_format (string) [default: "%Y-%m-%d"]
#>     Date format for output
#>   lag_values (array) [default: [1,2,3]]
#>     Lag periods to include
#> 
#> * = required

Using the Schema with CLI

Pass the schema to create_chap_cli() to enable automatic validation:

if (!interactive()) {
  create_chap_cli(train_fn, predict_fn, model_config_schema = my_schema)
}

When a user provides a configuration file (YAML or JSON), the CLI will:

  1. Validate the config against the schema (type checking, range constraints, enum values)
  2. Apply defaults for any missing optional parameters
  3. Report errors with clear messages if validation fails

Example config.yaml:

n_samples: 500
learning_rate: 0.05
method: ets

Manual Validation

You can also validate configurations manually:

# Valid configuration
config <- list(n_samples = 500L, method = "ets")
result <- validate_config(config, my_schema)
result$valid
#> [1] TRUE

# Invalid configuration (value out of range)
bad_config <- list(n_samples = -5L)
result <- validate_config(bad_config, my_schema)
result$valid
#> [1] FALSE
result$errors
#> [1] "/n_samples: must be >= 1"

# Apply defaults to fill in missing values
partial_config <- list(n_samples = 200L)
full_config <- apply_config_defaults(partial_config, my_schema)
full_config$n_samples      # User value preserved
#> [1] 200
full_config$learning_rate  # Default applied
#> [1] 0.01
full_config$method         # Default applied
#> [1] "arima"

Available Schema Types

Function Description Key Options
schema_integer() Integer values minimum, maximum, default
schema_number() Numeric (float) values minimum, maximum, default
schema_string() String values min_length, max_length, pattern, default
schema_boolean() TRUE/FALSE values default
schema_enum() One of fixed choices values (required), default
schema_array() Arrays/lists items, min_items, max_items, default

Adding Config Schema to MLproject

Once you have defined a configuration schema, you can include it in your MLproject file. This exposes your model’s configuration options to chap-core users:

generate_mlproject(
  model_name = "my_mean_model",
  config_schema = my_schema
)

This adds a user_options section to the MLproject file:

name: my_mean_model
renv_env: renv.lock
user_options:
  n_samples:
    type: integer
    description: Number of Monte Carlo samples for predictions
    default: 100
  learning_rate:
    type: float
    description: Learning rate for optimization
    default: 0.01
  method:
    type: str
    description: Forecasting method to use
    default: arima
entry_points:
  train:
    parameters:
      train_data: str
      model: str
    command: Rscript model.R train --data {train_data} --model {model}
  predict:
    parameters:
      historic_data: str
      future_data: str
      model: str
      out_file: str
    command: Rscript model.R predict --historic {historic_data} --future {future_data}
      --model {model} --output {out_file}

The user_options section allows chap-core to present configuration options to users and pass them to your model via a config file.

Summary

The development workflow is:

  1. Explore example data with get_example_data()
  2. Validate with stubs using validate_model_io() to understand requirements
  3. Implement your train and predict functions
  4. Validate the implementation
  5. Test against all datasets with validate_model_io_all()
  6. Deploy with create_chap_cli()

Next Steps

  • See examples/ewars_model/ for a more complex example with configuration
  • See examples/arima_model/ for a complete example with renv and MLproject integration
  • Read about MLproject generation in ?generate_mlproject
  • Read about configuration schemas in ?create_config_schema
  • Explore spatial-temporal utilities in ?aggregate_temporal