Getting Started with chapr • chapr

Installation

Install the development version from GitHub:

# install.packages("remotes")
remotes::install_github("dhis2-chap/chap_r_sdk")

Note: If prompted for GitHub authentication, you can skip it by pressing Enter (the repository is public).

Load the package:

library(chapr)

What is the Chap R SDK?

The chapr package provides infrastructure for developing disease forecasting models compatible with the Chap platform. Chap (Climate Health Analytics Platform) enables health ministries to run predictive models for disease surveillance.

This SDK simplifies model development by handling:

CLI creation: Command-line interfaces for train/predict workflows
File I/O: Automatic CSV loading, tsibble conversion, output formatting
Configuration: YAML/JSON config parsing with schema validation
Validation: Test suites to verify Chap compatibility

Quick Start

The recommended pattern uses create_chap_cli() to create a complete command-line interface:

library(chapr)
library(dplyr)

# Define training function - receives loaded tsibble, not file paths
train_my_model <- function(training_data, model_configuration = list(), run_info = list()) {
  means <- training_data |>
    group_by(location) |>
    summarise(mean_cases = mean(disease_cases, na.rm = TRUE))

  return(list(means = means))
}

# Define prediction function - all inputs already loaded
predict_my_model <- function(historic_data, future_data, saved_model,
                              model_configuration = list(), run_info = list()) {
  predictions <- future_data |>
    left_join(saved_model$means, by = "location") |>
    mutate(samples = purrr::map(mean_cases, ~c(.x))) |>
    select(-mean_cases)

  return(predictions)
}

# Enable CLI with one function call
if (!interactive()) {
  create_chap_cli(train_my_model, predict_my_model)
}

Command Line Usage

Save the above code as model.R, then use from the command line.

In terminal:

# Train the model
Rscript model.R train --data training_data.csv

# Generate predictions
Rscript model.R predict --historic historic.csv --future future.csv \
    --output predictions.csv

# Display model information
Rscript model.R info

Model Function Interface

Your model needs two functions:

Training Function

train_fn <- function(training_data, model_configuration = list(), run_info = list()) {
  # training_data: tsibble with time_period index, location key, disease_cases
  # model_configuration: optional list of parameters from config file
  # run_info: runtime info from Chap (prediction_length, additional_continuous_covariates, etc.)
  # Returns: model object (saved as RDS)
}

Prediction Function

predict_fn <- function(historic_data, future_data, saved_model,
                       model_configuration = list(), run_info = list()) {
  # historic_data: tsibble with historical observations
  # future_data: tsibble with time periods to predict (no disease_cases)
  # saved_model: object returned by train_fn
  # run_info: runtime info from Chap
  # Returns: tibble with samples list-column
}

Important: historic_data may contain more recent observations than training data. Time series models should refit to historic_data before forecasting.

What the SDK Handles

You don’t need to write code for:

Task	SDK handles it
Loading CSV files	`readr::read_csv()`
Converting to tsibbles	`tsibble::as_tsibble()`
Detecting time columns	Finds `time_period`, `date`, `week`, etc.
Detecting key columns	Finds `location`, `region`, etc.
Loading/saving models	`readRDS()` / `saveRDS()`
Parsing configs	`yaml::yaml.load_file()`

Your functions only contain business logic - no file I/O boilerplate.

Data Format

Training/Historic Data

CSV with time, location, target, and covariates:

time_period,location,disease_cases,population,rainfall
2023-01,LocationA,45,10000,120.5
2023-02,LocationA,52,10000,85.2
2023-01,LocationB,78,15000,130.1

Future Data

Same structure without the target variable:

time_period,location,population,rainfall
2023-05,LocationA,10000,95.0
2023-06,LocationA,10000,110.3

Prediction Output

Tibble with samples list-column containing numeric vectors:

# Deterministic: single value per row
tibble(
  time_period = "2023-05",
  location = "LocationA",
  samples = list(c(42))
)

# Probabilistic: multiple Monte Carlo samples
tibble(
  time_period = "2023-05",
  location = "LocationA",
  samples = list(rpois(1000, lambda = 42))
)

Configuration

Reading Configuration Files

# Via CLI - config passed automatically to your functions
Rscript model.R train --data data.csv --config config.yaml

# Or read manually:
config <- read_model_config("config.yaml")

Safe Parameter Extraction

config <- list(
  model = list(
    params = list(learning_rate = 0.01, epochs = 100)
  )
)

# Extract nested parameters with defaults
lr <- get_config_param(config, "model", "params", "learning_rate", .default = 0.001)
print(lr)
#> [1] 0.01

# Returns default if path not found
missing <- get_config_param(config, "model", "missing", .default = "default")
print(missing)
#> [1] "default"

Configuration Schema

Define a schema for validation and the info subcommand:

config_schema <- create_config_schema(
  title = "My Model Configuration",
  properties = list(
    n_samples = schema_integer(default = 100L, minimum = 1L),
    learning_rate = schema_number(default = 0.01, minimum = 0, maximum = 1)
  )
)

create_chap_cli(train_fn, predict_fn, model_config_schema = config_schema)

Next Steps

Building Your First Chap Model: Step-by-step tutorial with validation
Working with Spatio-Temporal Data: Utilities for aggregation and transformation
Function Reference: Complete API documentation

Getting Help

Issues: GitHub Issues
CHAP Platform: github.com/dhis2/chap-core