Building Your First Chap Model • chapr

This tutorial walks you through building a Chap-compatible model step-by-step, using a validation-first approach to ensure your model works correctly before deploying it.

Setup

library(chapr)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

Step 1: Understand the Function Interface

Every Chap model requires two functions:

Training function:

train_fn <- function(training_data, model_configuration = list(), run_info = list()) {
  # training_data: tsibble with time_period index, location key,
  #                disease_cases, and covariates
  # model_configuration: optional list of parameters
  # run_info: runtime info from Chap (prediction_length, additional_continuous_covariates)
  # Returns: any model object (list, fitted model, etc.)
}

Prediction function:

predict_fn <- function(historic_data, future_data, saved_model,
                       model_configuration = list(), run_info = list()) {
  # historic_data: tsibble with historical observations
  # future_data: tsibble with time periods to predict
  # saved_model: the object returned by train_fn
  # run_info: runtime info from Chap
  # Returns: tibble with samples list-column containing numeric vectors
  #   - For deterministic models: single sample per row (e.g., samples = list(c(42)))
  #   - For probabilistic models: multiple samples per row (e.g., 1000 samples)
  #
  # IMPORTANT: historic_data may contain more recent data than training_data.
  # For time series models, you should refit to historic_data before forecasting.
  # Use saved_model for hyperparameters/structure, not the fitted model itself.
  # See examples/arima_model/ for a demonstration of this pattern.
}

Step 2: Explore the Example Data

The SDK provides example datasets for testing. Let’s examine the Laos monthly data:

data <- get_example_data('laos', 'M')
#> Registered S3 method overwritten by 'tsibble':
#>   method               from 
#>   as_tibble.grouped_df dplyr
names(data)
#> [1] "training_data" "historic_data" "future_data"   "predictions"

The example data contains four tsibbles. Each has time_period as the index and location as the key:

Training data - what your model learns from:

data$training_data
#> # A tsibble: 1,057 x 12 [1M]
#> # Key:       location [7]
#>    time_period rainfall mean_temperature disease_cases population parent
#>          <mth>    <dbl>            <dbl>         <dbl>      <dbl> <chr> 
#>  1    2000 Jul   430.               23.4             0     58503. -     
#>  2    2000 Aug   322.               23.8             0     58503. -     
#>  3    2000 Sep   265.               22.7             0     58503. -     
#>  4    2000 Oct   103.               22.6             0     58503. -     
#>  5    2000 Nov    19.7              20.3             0     58503. -     
#>  6    2000 Dec    26.0              19.1             0     58503. -     
#>  7    2001 Jan    17.6              19.8             0     60157. -     
#>  8    2001 Feb     7.28             22.0             0     60157. -     
#>  9    2001 Mar   123.               22.6             0     60157. -     
#> 10    2001 Apr    29.6              27.5             0     60157. -     
#> # ℹ 1,047 more rows
#> # ℹ 6 more variables: location <chr>, Cases <dbl>, E <dbl>, month <dbl>,
#> #   ID_year <dbl>, ID_spat <chr>

Historic data - observations available at prediction time:

data$historic_data
#> # A tsibble: 1,071 x 12 [1M]
#> # Key:       location [7]
#>    time_period rainfall mean_temperature disease_cases population parent
#>          <mth>    <dbl>            <dbl>         <dbl>      <dbl> <chr> 
#>  1    2000 Jul   430.               23.4             0     58503. -     
#>  2    2000 Aug   322.               23.8             0     58503. -     
#>  3    2000 Sep   265.               22.7             0     58503. -     
#>  4    2000 Oct   103.               22.6             0     58503. -     
#>  5    2000 Nov    19.7              20.3             0     58503. -     
#>  6    2000 Dec    26.0              19.1             0     58503. -     
#>  7    2001 Jan    17.6              19.8             0     60157. -     
#>  8    2001 Feb     7.28             22.0             0     60157. -     
#>  9    2001 Mar   123.               22.6             0     60157. -     
#> 10    2001 Apr    29.6              27.5             0     60157. -     
#> # ℹ 1,061 more rows
#> # ℹ 6 more variables: location <chr>, Cases <dbl>, E <dbl>, month <dbl>,
#> #   ID_year <dbl>, ID_spat <chr>

Notice that historic_data extends beyond training_data:

# Training data ends at:
max(data$training_data$time_period)
#> <yearmonth[1]>
#> [1] "2013 Jan"

# Historic data ends at:
max(data$historic_data$time_period)
#> <yearmonth[1]>
#> [1] "2013 Mar"

This is a key concept: when Chap calls your prediction function, historic_data may contain more recent observations than what the model was trained on. For time series models (ARIMA, exponential smoothing, etc.), you should refit the model to historic_data before forecasting. See examples/arima_model/ for a demonstration of this pattern.

Future data - time periods to predict (no disease_cases):

data$future_data
#> # A tsibble: 21 x 10 [1M]
#> # Key:       location [7]
#>    time_period rainfall mean_temperature population parent location      E month
#>          <mth>    <dbl>            <dbl>      <dbl> <chr>  <chr>     <dbl> <dbl>
#>  1    2013 Apr     39.5             26.8     80014. -      Bokeo    8.00e4     4
#>  2    2013 May    170.              25.8     80014. -      Bokeo    8.00e4     5
#>  3    2013 Jun    231.              24.7     80014. -      Bokeo    8.00e4     6
#>  4    2013 Apr    152.              27.2    731598. -      Champas… 7.32e5     4
#>  5    2013 May    236.              26.3    731598. -      Champas… 7.32e5     5
#>  6    2013 Jun    327.              25.1    731598. -      Champas… 7.32e5     6
#>  7    2013 Apr     58.8             25.1    124396. -      LouangN… 1.24e5     4
#>  8    2013 May    162.              24.7    124396. -      LouangN… 1.24e5     5
#>  9    2013 Jun    184.              24.0    124396. -      LouangN… 1.24e5     6
#> 10    2013 Apr     72.5             25.0    282683. -      Oudomxai 2.83e5     4
#> # ℹ 11 more rows
#> # ℹ 2 more variables: ID_year <dbl>, ID_spat <chr>

Example predictions - what your model should output:

data$predictions
#> # A tibble: 21 × 3
#>    time_period location     samples      
#>    <chr>       <chr>        <list>       
#>  1 2013-04     Bokeo        <dbl [1,000]>
#>  2 2013-05     Bokeo        <dbl [1,000]>
#>  3 2013-06     Bokeo        <dbl [1,000]>
#>  4 2013-04     Champasak    <dbl [1,000]>
#>  5 2013-05     Champasak    <dbl [1,000]>
#>  6 2013-06     Champasak    <dbl [1,000]>
#>  7 2013-04     LouangNamtha <dbl [1,000]>
#>  8 2013-05     LouangNamtha <dbl [1,000]>
#>  9 2013-06     LouangNamtha <dbl [1,000]>
#> 10 2013-04     Oudomxai     <dbl [1,000]>
#> # ℹ 11 more rows

The predictions tibble has a samples list-column where each element is a numeric vector. Let’s look at the structure:

# Each row has a vector of samples
data$predictions$samples[[1]]
#>    [1]  9  5 46  5  3  8  1  9  9 14  3  7  6  7  3 10 11  2 17 13 12  0  3  1
#>   [25] 10 38 19  7 15  0  1 11  1  8  7 43 11  3  4  8  1  4  3  7 24  4  1  6
#>   [49] 11  3  9  0  0 16  0 11 17  1  3  8 13  5  5 22  2  1  5 26 11 31  4  3
#>   [73] 10  7  4  6  9 13  1  5 14 13 14  8  4  1  0  7 31  6  6  2 16 12  2  9
#>   [97]  1  5  7  3  6  4 22  1  3 10  1  0  3  7 10  3  0  8 30  8  7  1 17 14
#>  [121] 10  8  4  8  2  5 20  3 11  7  6 11  2  0  4  3  4  5  3  8  4 21  5  7
#>  [145] 15  8 17  8  9 12  6  1 11  3 21  4 10  2  3  4 17  2  1 11 33  6 10  2
#>  [169] 25 12  3  0  3  1  5 16 22  0  2 24 15 13  3  1  4  0  6  2  8  1  2  1
#>  [193] 21 17  0  6  8  9  8  2  0  3 21  3 14  5  4  6  5 18 14 17  4  2 10  9
#>  [217]  7  2  3  2  3  3  1  4  3  5  3  8  1 15  4  5 11 16  8 26  0  6  3 10
#>  [241]  4 10  2  3  0  8  3 18  2  8  2  1  4  2  6  5 10 13 10  8  9  3  1  1
#>  [265]  6  7  7  9 13  8  1  7 28 23  6  7 12  8  1  4  1  6  1  2  0  6  2 16
#>  [289] 44  2 20 24 14  9  2  4 12  7  6 23 39 16 17  5 27 17  2  4  7  3  1  0
#>  [313]  6  1 10  3  6 19  2  3  6 13 14  5  8  1 14  6 15  5  4  3 30  7  9  4
#>  [337] 17  3  4  4  1  6 11 15  4 11 15  1  3  8  4 15 15  0  7 13 14 26  6  1
#>  [361]  5 15  1 21 13  6 22  6  1 19  7  0 11 13 10  1  0  2  1  7 13  1  1  9
#>  [385] 12 24  7  2  8  9  2  0  5  0  2  2  2  7 15 10 15 10  7  1 26  2 20 11
#>  [409] 41 15 13  4  7  1  8  3  4 16 27  0  1  2  3 22 20  8  3  9  7  9  3 44
#>  [433]  0 17  5  1  8 17 18  8 27 16  4  7  1  4  1 13 22  0  7  2 12 18 49  9
#>  [457]  1  1  2 16  2  5  2 12  3  8  3  5  1  1  0 28 22  1  4  6 11  5 13 24
#>  [481]  4  2 13  1  5  1  5 12  6  5  6  5 14  9  2 15  2 27 25  9  5  5  3  6
#>  [505]  3 25  9 16  2 14  6  5 25 11  4 10  2 11  5  6 16  3  2  6  3  4  5 18
#>  [529]  9  2  8  6  3 11  7 16 22  4  0  3  3 19 10  3  2  5  3 10  4 18  1  6
#>  [553]  5 19  3  1  8  2  6  5 12 15 11  1  0  0 22  6  7  2 40 21 20  2  9  9
#>  [577] 11  0  0  2  6 12  6  5 19  8 17  7  3 39  0  6 11  8  9  8 10  6  5  2
#>  [601]  3 11  3 18  0  3  5  5  1  5  2  1 37 15  7  1  0  3  1 17  1 19  3 20
#>  [625] 12 11  8  2  3  2  2  0  9  1 17 14  4 10 10 13  5 12  8  0  5  9 41  9
#>  [649] 38  8 13  6 15  7  8  4 17  1  3  9  1 33  9  5 21 27  3 22 10 16  6 21
#>  [673]  0 12 22  1  8 22  1  1 62 13 17 13 15  6 37  2 19  5  3  6  3  1  1 15
#>  [697] 31  7  3  7 15  6  1  4  5  3  6  9 17  0  5  5  1  9  5  2  0 26  1  4
#>  [721]  0  1 14  8 10  4  2  7  3  4 17  3  2 93  6  7 11  1  5 16  7 11  5 15
#>  [745]  7 15  6  4  3 38  0  3 17  4  1  0 13  5  2 10  7  2  1  8  8  6  3  0
#>  [769]  2  1  9  0  7 13  5  4  2 11  5  4  6  3  9 18  2  7  2 16  3  2  5 26
#>  [793]  3  5 18  4 25  5 12  8  2 14 15  1  5 19 14 21  0  3 26  3  2 17 39 14
#>  [817] 21 21  3 10 19 12  3  1  6  0  3 24 59  2 15 12  3  9  7 16  3 16  6  6
#>  [841]  0 15  8 12 17  6  3 12  2  2 15 10 11  1  1  1 16  2  7  0 11 39  6  2
#>  [865]  5  6 13  4 11  2  4  4  9 38  1  8 16 15 21 15  7 16  2  3 10 33  9  7
#>  [889]  4  2  6 12  9  1 16  4  5  7  3  3  4  1  3  5  9  5  9 14 23  3 21  2
#>  [913]  4  6  6 26  5 11  0  7  5  4  9  2 26  5  5 10 16 17 30  2  8 10  5 25
#>  [937]  1  1 47  7  2  4  2  0  7  0 20  1 19  2  6  5 15  5 14  5  7  7  4  2
#>  [961]  0  9  3  3 15  0  2  0  5 14  1 44 16  0 14  2  1 15  4 29 10 21  6  6
#>  [985]  0  6 24  7 16  0 12  5  2  6  5 31  9 14  0  3

For probabilistic models, each vector contains multiple Monte Carlo samples (e.g., 1000). For deterministic models, use a single sample per row: samples = list(c(42)).

Step 3: Validate Before Implementing

Before writing any model logic, let’s see what the validation expects. Start with stub functions:

train_fn <- function(training_data, model_configuration = list(), run_info = list()) {
  list(dummy = 1)
}

predict_fn <- function(historic_data, future_data, saved_model,
                       model_configuration = list(), run_info = list()) {
  future_data
}

result <- validate_model_io(train_fn, predict_fn, data)
result$success
#> [1] FALSE
result$errors
#> [1] "Predictions must have a 'samples' list-column containing numeric vectors"

The validation tells us exactly what’s missing: the samples list-column in predictions.

Step 4: Implement a Simple Mean Model

Now let’s implement a minimal model that predicts the historical mean for each location. Since all models must return a samples list-column, we wrap the single prediction value in a list:

train_fn <- function(training_data, model_configuration = list(), run_info = list()) {
  means <- training_data |>
    as_tibble() |>
    summarise(mean_cases = mean(disease_cases, na.rm = TRUE), .by = location)
  list(means = means)
}

predict_fn <- function(historic_data, future_data, saved_model,
                       model_configuration = list(), run_info = list()) {
  future_data |>
    left_join(saved_model$means, by = "location") |>
    mutate(samples = purrr::map(mean_cases, ~c(.x))) |>
    select(-mean_cases)
}

Note: We use as_tibble() in the training function because summarise(.by = ...) needs a tibble to collapse across the time dimension. The prediction function works directly on tsibbles since left_join() and mutate() preserve tsibble structure.

Step 5: Validate the Implementation

result <- validate_model_io(train_fn, predict_fn, data)
result$success
#> [1] TRUE
result$n_predictions
#> [1] 21

The validation passes and we generated 21 predictions.

Step 6: Validate Against All Datasets

The SDK can validate against all available example datasets:

result <- validate_model_io_all(train_fn, predict_fn)
result$success
#> [1] TRUE
names(result$results)
#> [1] "laos_M"

Step 7: Create the CLI

Once validation passes, wrap your model in a CLI.

First, create a new directory for your model project:

In terminal:

mkdir my_model
cd my_model

Then create a file called model.R inside this directory:

library(chapr)
library(dplyr)

train_fn <- function(training_data, model_configuration = list(), run_info = list()) {
  means <- training_data |>
    as_tibble() |>
    summarise(mean_cases = mean(disease_cases, na.rm = TRUE), .by = location)
  list(means = means)
}

predict_fn <- function(historic_data, future_data, saved_model,
                       model_configuration = list(), run_info = list()) {
  future_data |>
    left_join(saved_model$means, by = "location") |>
    mutate(samples = purrr::map(mean_cases, ~c(.x))) |>
    select(-mean_cases)
}

if (!interactive()) {
  create_chap_cli(train_fn, predict_fn)
}

Step 8: Use the CLI

Your model is now ready for command-line use. First, export the example data to CSV files for testing.

In R:

# Export example data for testing
data <- get_example_data('laos', 'M')
write.csv(as.data.frame(data$training_data), "training_data.csv", row.names = FALSE)
write.csv(as.data.frame(data$historic_data), "historic.csv", row.names = FALSE)
write.csv(as.data.frame(data$future_data), "future.csv", row.names = FALSE)

Now you can test the CLI.

In terminal:

# Train the model
Rscript model.R train --data training_data.csv

# Generate predictions
Rscript model.R predict --historic historic.csv --future future.csv \
    --output predictions.csv

# Display model info
Rscript model.R info

The CLI automatically handles:

Loading CSV files
Converting to tsibbles
Saving the model as RDS
Writing predictions to CSV (converting nested samples to wide format)

Step 9: Create MLproject for chap-core

To run your model with chap-core, you need an MLproject file. The SDK can generate this automatically using generate_mlproject().

Prerequisites

First, set up renv for reproducible dependencies:

renv::init()
renv::install("dhis2-chap/chap_r_sdk")
renv::snapshot()

Generate the MLproject File

library(chapr)

generate_mlproject(model_name = "my_mean_model")

This creates an MLproject file like:

name: my_mean_model
renv_env: renv.lock
entry_points:
  train:
    parameters:
      train_data: str
      model: str
    command: Rscript model.R train --data {train_data} --model {model}
  predict:
    parameters:
      historic_data: str
      future_data: str
      model: str
      out_file: str
    command: Rscript model.R predict --historic {historic_data} --future {future_data}
      --model {model} --output {out_file}

Run with chap-core

Once you have the MLproject and renv.lock files, run your model with chap-core:

chap evaluate --model-name ./my_model_directory --dataset-csv data.csv

See examples/arima_model/ for a complete example with renv and MLproject.

Probabilistic Models

For probabilistic forecasting, include multiple Monte Carlo samples instead of a single value:

predict_fn <- function(historic_data, future_data, saved_model,
                       model_configuration = list(), run_info = list()) {
  n_samples <- 1000

  future_data |>
    left_join(saved_model$means, by = "location") |>
    rowwise() |>
    mutate(
      # Generate 1000 samples from Poisson distribution
      samples = list(rpois(n_samples, lambda = mean_cases))
    ) |>
    ungroup() |>
    select(-mean_cases)
}

The samples column is a list-column where each element is a numeric vector. The CLI automatically converts this to wide CSV format (sample_0, sample_1, …) for Chap.

Working with Samples

The SDK provides utility functions for working with sample-based predictions:

# Convert nested samples to wide format
wide_preds <- predictions_to_wide(nested_preds)

# Convert to long format for scoringutils
long_preds <- predictions_to_long(nested_preds)

# Compute quantiles for hub submissions
quantile_preds <- predictions_to_quantiles(nested_preds)

# Add summary statistics (mean, median, CIs)
preds_with_summary <- predictions_summary(nested_preds)

Configuration Schemas

You can define a configuration schema to validate user-provided settings and provide default values. The SDK uses JSON Schema (draft-07) for validation.

Defining a Schema

Use the schema helper functions to define type-safe configuration options:

my_schema <- create_config_schema(
  title = "My Model Configuration",
  description = "Configuration options for my forecasting model",
  properties = list(
    # Integer with range constraints
    n_samples = schema_integer(
      description = "Number of Monte Carlo samples for predictions",
      default = 100L,
      minimum = 1L,
      maximum = 10000L
    ),
    # Number (float) with bounds
    learning_rate = schema_number(
      description = "Learning rate for optimization",
      default = 0.01,
      minimum = 0,
      maximum = 1
    ),
    # Enum: one of a fixed set of values
    method = schema_enum(
      values = c("arima", "ets", "prophet"),
      description = "Forecasting method to use",
      default = "arima"
    ),
    # Boolean flag
    use_covariates = schema_boolean(
      description = "Whether to include covariates in the model",
      default = TRUE
    ),
    # String with optional pattern constraint
    date_format = schema_string(
      description = "Date format for output",
      default = "%Y-%m-%d"
    ),
    # Array of values
    lag_values = schema_array(
      items = list(type = "integer"),
      description = "Lag periods to include",
      default = list(1L, 2L, 3L)
    )
  ),
  required = c("n_samples")  # Mark required fields
)

# View the schema
print(my_schema)
#> Chap Configuration Schema
#> =========================
#> 
#> Title: My Model Configuration 
#> Description: Configuration options for my forecasting model 
#> 
#> Properties:
#>   n_samples * (integer) [default: 100]
#>     Number of Monte Carlo samples for predictions
#>   learning_rate (number) [default: 0.01]
#>     Learning rate for optimization
#>   method (enum(arima, ets, prophet)) [default: "arima"]
#>     Forecasting method to use
#>   use_covariates (boolean) [default: true]
#>     Whether to include covariates in the model
#>   date_format (string) [default: "%Y-%m-%d"]
#>     Date format for output
#>   lag_values (array) [default: [1,2,3]]
#>     Lag periods to include
#> 
#> * = required

Using the Schema with CLI

Pass the schema to create_chap_cli() to enable automatic validation:

if (!interactive()) {
  create_chap_cli(train_fn, predict_fn, model_config_schema = my_schema)
}

When a user provides a configuration file (YAML or JSON), the CLI will:

Validate the config against the schema (type checking, range constraints, enum values)
Apply defaults for any missing optional parameters
Report errors with clear messages if validation fails

Example config.yaml:

n_samples: 500
learning_rate: 0.05
method: ets

Manual Validation

You can also validate configurations manually:

# Valid configuration
config <- list(n_samples = 500L, method = "ets")
result <- validate_config(config, my_schema)
result$valid
#> [1] TRUE

# Invalid configuration (value out of range)
bad_config <- list(n_samples = -5L)
result <- validate_config(bad_config, my_schema)
result$valid
#> [1] FALSE
result$errors
#> [1] "/n_samples: must be >= 1"

# Apply defaults to fill in missing values
partial_config <- list(n_samples = 200L)
full_config <- apply_config_defaults(partial_config, my_schema)
full_config$n_samples      # User value preserved
#> [1] 200
full_config$learning_rate  # Default applied
#> [1] 0.01
full_config$method         # Default applied
#> [1] "arima"

Available Schema Types

Function	Description	Key Options
`schema_integer()`	Integer values	`minimum`, `maximum`, `default`
`schema_number()`	Numeric (float) values	`minimum`, `maximum`, `default`
`schema_string()`	String values	`min_length`, `max_length`, `pattern`, `default`
`schema_boolean()`	TRUE/FALSE values	`default`
`schema_enum()`	One of fixed choices	`values` (required), `default`
`schema_array()`	Arrays/lists	`items`, `min_items`, `max_items`, `default`

Adding Config Schema to MLproject

Once you have defined a configuration schema, you can include it in your MLproject file. This exposes your model’s configuration options to chap-core users:

generate_mlproject(
  model_name = "my_mean_model",
  config_schema = my_schema
)

This adds a user_options section to the MLproject file:

name: my_mean_model
renv_env: renv.lock
user_options:
  n_samples:
    type: integer
    description: Number of Monte Carlo samples for predictions
    default: 100
  learning_rate:
    type: float
    description: Learning rate for optimization
    default: 0.01
  method:
    type: str
    description: Forecasting method to use
    default: arima
entry_points:
  train:
    parameters:
      train_data: str
      model: str
    command: Rscript model.R train --data {train_data} --model {model}
  predict:
    parameters:
      historic_data: str
      future_data: str
      model: str
      out_file: str
    command: Rscript model.R predict --historic {historic_data} --future {future_data}
      --model {model} --output {out_file}

The user_options section allows chap-core to present configuration options to users and pass them to your model via a config file.

Summary

The development workflow is:

Explore example data with get_example_data()
Validate with stubs using validate_model_io() to understand requirements
Implement your train and predict functions
Validate the implementation
Test against all datasets with validate_model_io_all()
Deploy with create_chap_cli()

Next Steps

See examples/ewars_model/ for a more complex example with configuration
See examples/arima_model/ for a complete example with renv and MLproject integration
Read about MLproject generation in ?generate_mlproject
Read about configuration schemas in ?create_config_schema
Explore spatial-temporal utilities in ?aggregate_temporal