Building Your First Chap Model
Source:vignettes/model-development-tutorial.Rmd
model-development-tutorial.RmdThis tutorial walks you through building a Chap-compatible model step-by-step, using a validation-first approach to ensure your model works correctly before deploying it.
Step 1: Understand the Function Interface
Every Chap model requires two functions:
Training function:
train_fn <- function(training_data, model_configuration = list(), run_info = list()) {
# training_data: tsibble with time_period index, location key,
# disease_cases, and covariates
# model_configuration: optional list of parameters
# run_info: runtime info from Chap (prediction_length, additional_continuous_covariates)
# Returns: any model object (list, fitted model, etc.)
}Prediction function:
predict_fn <- function(historic_data, future_data, saved_model,
model_configuration = list(), run_info = list()) {
# historic_data: tsibble with historical observations
# future_data: tsibble with time periods to predict
# saved_model: the object returned by train_fn
# run_info: runtime info from Chap
# Returns: tibble with samples list-column containing numeric vectors
# - For deterministic models: single sample per row (e.g., samples = list(c(42)))
# - For probabilistic models: multiple samples per row (e.g., 1000 samples)
#
# IMPORTANT: historic_data may contain more recent data than training_data.
# For time series models, you should refit to historic_data before forecasting.
# Use saved_model for hyperparameters/structure, not the fitted model itself.
# See examples/arima_model/ for a demonstration of this pattern.
}Step 2: Explore the Example Data
The SDK provides example datasets for testing. Let’s examine the Laos monthly data:
data <- get_example_data('laos', 'M')
#> Registered S3 method overwritten by 'tsibble':
#> method from
#> as_tibble.grouped_df dplyr
names(data)
#> [1] "training_data" "historic_data" "future_data" "predictions"The example data contains four tsibbles. Each has
time_period as the index and location as the
key:
Training data - what your model learns from:
data$training_data
#> # A tsibble: 1,057 x 12 [1M]
#> # Key: location [7]
#> time_period rainfall mean_temperature disease_cases population parent
#> <mth> <dbl> <dbl> <dbl> <dbl> <chr>
#> 1 2000 Jul 430. 23.4 0 58503. -
#> 2 2000 Aug 322. 23.8 0 58503. -
#> 3 2000 Sep 265. 22.7 0 58503. -
#> 4 2000 Oct 103. 22.6 0 58503. -
#> 5 2000 Nov 19.7 20.3 0 58503. -
#> 6 2000 Dec 26.0 19.1 0 58503. -
#> 7 2001 Jan 17.6 19.8 0 60157. -
#> 8 2001 Feb 7.28 22.0 0 60157. -
#> 9 2001 Mar 123. 22.6 0 60157. -
#> 10 2001 Apr 29.6 27.5 0 60157. -
#> # ℹ 1,047 more rows
#> # ℹ 6 more variables: location <chr>, Cases <dbl>, E <dbl>, month <dbl>,
#> # ID_year <dbl>, ID_spat <chr>Historic data - observations available at prediction time:
data$historic_data
#> # A tsibble: 1,071 x 12 [1M]
#> # Key: location [7]
#> time_period rainfall mean_temperature disease_cases population parent
#> <mth> <dbl> <dbl> <dbl> <dbl> <chr>
#> 1 2000 Jul 430. 23.4 0 58503. -
#> 2 2000 Aug 322. 23.8 0 58503. -
#> 3 2000 Sep 265. 22.7 0 58503. -
#> 4 2000 Oct 103. 22.6 0 58503. -
#> 5 2000 Nov 19.7 20.3 0 58503. -
#> 6 2000 Dec 26.0 19.1 0 58503. -
#> 7 2001 Jan 17.6 19.8 0 60157. -
#> 8 2001 Feb 7.28 22.0 0 60157. -
#> 9 2001 Mar 123. 22.6 0 60157. -
#> 10 2001 Apr 29.6 27.5 0 60157. -
#> # ℹ 1,061 more rows
#> # ℹ 6 more variables: location <chr>, Cases <dbl>, E <dbl>, month <dbl>,
#> # ID_year <dbl>, ID_spat <chr>Notice that historic_data extends beyond
training_data:
# Training data ends at:
max(data$training_data$time_period)
#> <yearmonth[1]>
#> [1] "2013 Jan"
# Historic data ends at:
max(data$historic_data$time_period)
#> <yearmonth[1]>
#> [1] "2013 Mar"This is a key concept: when Chap calls your prediction function,
historic_data may contain more recent
observations than what the model was trained on. For time
series models (ARIMA, exponential smoothing, etc.), you should
refit the model to historic_data before
forecasting. See examples/arima_model/ for a demonstration
of this pattern.
Future data - time periods to predict (no
disease_cases):
data$future_data
#> # A tsibble: 21 x 10 [1M]
#> # Key: location [7]
#> time_period rainfall mean_temperature population parent location E month
#> <mth> <dbl> <dbl> <dbl> <chr> <chr> <dbl> <dbl>
#> 1 2013 Apr 39.5 26.8 80014. - Bokeo 8.00e4 4
#> 2 2013 May 170. 25.8 80014. - Bokeo 8.00e4 5
#> 3 2013 Jun 231. 24.7 80014. - Bokeo 8.00e4 6
#> 4 2013 Apr 152. 27.2 731598. - Champas… 7.32e5 4
#> 5 2013 May 236. 26.3 731598. - Champas… 7.32e5 5
#> 6 2013 Jun 327. 25.1 731598. - Champas… 7.32e5 6
#> 7 2013 Apr 58.8 25.1 124396. - LouangN… 1.24e5 4
#> 8 2013 May 162. 24.7 124396. - LouangN… 1.24e5 5
#> 9 2013 Jun 184. 24.0 124396. - LouangN… 1.24e5 6
#> 10 2013 Apr 72.5 25.0 282683. - Oudomxai 2.83e5 4
#> # ℹ 11 more rows
#> # ℹ 2 more variables: ID_year <dbl>, ID_spat <chr>Example predictions - what your model should output:
data$predictions
#> # A tibble: 21 × 3
#> time_period location samples
#> <chr> <chr> <list>
#> 1 2013-04 Bokeo <dbl [1,000]>
#> 2 2013-05 Bokeo <dbl [1,000]>
#> 3 2013-06 Bokeo <dbl [1,000]>
#> 4 2013-04 Champasak <dbl [1,000]>
#> 5 2013-05 Champasak <dbl [1,000]>
#> 6 2013-06 Champasak <dbl [1,000]>
#> 7 2013-04 LouangNamtha <dbl [1,000]>
#> 8 2013-05 LouangNamtha <dbl [1,000]>
#> 9 2013-06 LouangNamtha <dbl [1,000]>
#> 10 2013-04 Oudomxai <dbl [1,000]>
#> # ℹ 11 more rowsThe predictions tibble has a samples list-column where
each element is a numeric vector. Let’s look at the structure:
# Each row has a vector of samples
data$predictions$samples[[1]]
#> [1] 9 5 46 5 3 8 1 9 9 14 3 7 6 7 3 10 11 2 17 13 12 0 3 1
#> [25] 10 38 19 7 15 0 1 11 1 8 7 43 11 3 4 8 1 4 3 7 24 4 1 6
#> [49] 11 3 9 0 0 16 0 11 17 1 3 8 13 5 5 22 2 1 5 26 11 31 4 3
#> [73] 10 7 4 6 9 13 1 5 14 13 14 8 4 1 0 7 31 6 6 2 16 12 2 9
#> [97] 1 5 7 3 6 4 22 1 3 10 1 0 3 7 10 3 0 8 30 8 7 1 17 14
#> [121] 10 8 4 8 2 5 20 3 11 7 6 11 2 0 4 3 4 5 3 8 4 21 5 7
#> [145] 15 8 17 8 9 12 6 1 11 3 21 4 10 2 3 4 17 2 1 11 33 6 10 2
#> [169] 25 12 3 0 3 1 5 16 22 0 2 24 15 13 3 1 4 0 6 2 8 1 2 1
#> [193] 21 17 0 6 8 9 8 2 0 3 21 3 14 5 4 6 5 18 14 17 4 2 10 9
#> [217] 7 2 3 2 3 3 1 4 3 5 3 8 1 15 4 5 11 16 8 26 0 6 3 10
#> [241] 4 10 2 3 0 8 3 18 2 8 2 1 4 2 6 5 10 13 10 8 9 3 1 1
#> [265] 6 7 7 9 13 8 1 7 28 23 6 7 12 8 1 4 1 6 1 2 0 6 2 16
#> [289] 44 2 20 24 14 9 2 4 12 7 6 23 39 16 17 5 27 17 2 4 7 3 1 0
#> [313] 6 1 10 3 6 19 2 3 6 13 14 5 8 1 14 6 15 5 4 3 30 7 9 4
#> [337] 17 3 4 4 1 6 11 15 4 11 15 1 3 8 4 15 15 0 7 13 14 26 6 1
#> [361] 5 15 1 21 13 6 22 6 1 19 7 0 11 13 10 1 0 2 1 7 13 1 1 9
#> [385] 12 24 7 2 8 9 2 0 5 0 2 2 2 7 15 10 15 10 7 1 26 2 20 11
#> [409] 41 15 13 4 7 1 8 3 4 16 27 0 1 2 3 22 20 8 3 9 7 9 3 44
#> [433] 0 17 5 1 8 17 18 8 27 16 4 7 1 4 1 13 22 0 7 2 12 18 49 9
#> [457] 1 1 2 16 2 5 2 12 3 8 3 5 1 1 0 28 22 1 4 6 11 5 13 24
#> [481] 4 2 13 1 5 1 5 12 6 5 6 5 14 9 2 15 2 27 25 9 5 5 3 6
#> [505] 3 25 9 16 2 14 6 5 25 11 4 10 2 11 5 6 16 3 2 6 3 4 5 18
#> [529] 9 2 8 6 3 11 7 16 22 4 0 3 3 19 10 3 2 5 3 10 4 18 1 6
#> [553] 5 19 3 1 8 2 6 5 12 15 11 1 0 0 22 6 7 2 40 21 20 2 9 9
#> [577] 11 0 0 2 6 12 6 5 19 8 17 7 3 39 0 6 11 8 9 8 10 6 5 2
#> [601] 3 11 3 18 0 3 5 5 1 5 2 1 37 15 7 1 0 3 1 17 1 19 3 20
#> [625] 12 11 8 2 3 2 2 0 9 1 17 14 4 10 10 13 5 12 8 0 5 9 41 9
#> [649] 38 8 13 6 15 7 8 4 17 1 3 9 1 33 9 5 21 27 3 22 10 16 6 21
#> [673] 0 12 22 1 8 22 1 1 62 13 17 13 15 6 37 2 19 5 3 6 3 1 1 15
#> [697] 31 7 3 7 15 6 1 4 5 3 6 9 17 0 5 5 1 9 5 2 0 26 1 4
#> [721] 0 1 14 8 10 4 2 7 3 4 17 3 2 93 6 7 11 1 5 16 7 11 5 15
#> [745] 7 15 6 4 3 38 0 3 17 4 1 0 13 5 2 10 7 2 1 8 8 6 3 0
#> [769] 2 1 9 0 7 13 5 4 2 11 5 4 6 3 9 18 2 7 2 16 3 2 5 26
#> [793] 3 5 18 4 25 5 12 8 2 14 15 1 5 19 14 21 0 3 26 3 2 17 39 14
#> [817] 21 21 3 10 19 12 3 1 6 0 3 24 59 2 15 12 3 9 7 16 3 16 6 6
#> [841] 0 15 8 12 17 6 3 12 2 2 15 10 11 1 1 1 16 2 7 0 11 39 6 2
#> [865] 5 6 13 4 11 2 4 4 9 38 1 8 16 15 21 15 7 16 2 3 10 33 9 7
#> [889] 4 2 6 12 9 1 16 4 5 7 3 3 4 1 3 5 9 5 9 14 23 3 21 2
#> [913] 4 6 6 26 5 11 0 7 5 4 9 2 26 5 5 10 16 17 30 2 8 10 5 25
#> [937] 1 1 47 7 2 4 2 0 7 0 20 1 19 2 6 5 15 5 14 5 7 7 4 2
#> [961] 0 9 3 3 15 0 2 0 5 14 1 44 16 0 14 2 1 15 4 29 10 21 6 6
#> [985] 0 6 24 7 16 0 12 5 2 6 5 31 9 14 0 3For probabilistic models, each vector contains multiple Monte Carlo
samples (e.g., 1000). For deterministic models, use a single sample per
row: samples = list(c(42)).
Step 3: Validate Before Implementing
Before writing any model logic, let’s see what the validation expects. Start with stub functions:
train_fn <- function(training_data, model_configuration = list(), run_info = list()) {
list(dummy = 1)
}
predict_fn <- function(historic_data, future_data, saved_model,
model_configuration = list(), run_info = list()) {
future_data
}
result <- validate_model_io(train_fn, predict_fn, data)
result$success
#> [1] FALSE
result$errors
#> [1] "Predictions must have a 'samples' list-column containing numeric vectors"The validation tells us exactly what’s missing: the
samples list-column in predictions.
Step 4: Implement a Simple Mean Model
Now let’s implement a minimal model that predicts the historical mean
for each location. Since all models must return a samples
list-column, we wrap the single prediction value in a list:
train_fn <- function(training_data, model_configuration = list(), run_info = list()) {
means <- training_data |>
as_tibble() |>
summarise(mean_cases = mean(disease_cases, na.rm = TRUE), .by = location)
list(means = means)
}
predict_fn <- function(historic_data, future_data, saved_model,
model_configuration = list(), run_info = list()) {
future_data |>
left_join(saved_model$means, by = "location") |>
mutate(samples = purrr::map(mean_cases, ~c(.x))) |>
select(-mean_cases)
}Note: We use as_tibble() in the training function
because summarise(.by = ...) needs a tibble to collapse
across the time dimension. The prediction function works directly on
tsibbles since left_join() and mutate()
preserve tsibble structure.
Step 5: Validate the Implementation
result <- validate_model_io(train_fn, predict_fn, data)
result$success
#> [1] TRUE
result$n_predictions
#> [1] 21The validation passes and we generated 21 predictions.
Step 6: Validate Against All Datasets
The SDK can validate against all available example datasets:
result <- validate_model_io_all(train_fn, predict_fn)
result$success
#> [1] TRUE
names(result$results)
#> [1] "laos_M"Step 7: Create the CLI
Once validation passes, wrap your model in a CLI.
First, create a new directory for your model project:
In terminal:
Then create a file called model.R inside this
directory:
library(chapr)
library(dplyr)
train_fn <- function(training_data, model_configuration = list(), run_info = list()) {
means <- training_data |>
as_tibble() |>
summarise(mean_cases = mean(disease_cases, na.rm = TRUE), .by = location)
list(means = means)
}
predict_fn <- function(historic_data, future_data, saved_model,
model_configuration = list(), run_info = list()) {
future_data |>
left_join(saved_model$means, by = "location") |>
mutate(samples = purrr::map(mean_cases, ~c(.x))) |>
select(-mean_cases)
}
if (!interactive()) {
create_chap_cli(train_fn, predict_fn)
}Step 8: Use the CLI
Your model is now ready for command-line use. First, export the example data to CSV files for testing.
In R:
# Export example data for testing
data <- get_example_data('laos', 'M')
write.csv(as.data.frame(data$training_data), "training_data.csv", row.names = FALSE)
write.csv(as.data.frame(data$historic_data), "historic.csv", row.names = FALSE)
write.csv(as.data.frame(data$future_data), "future.csv", row.names = FALSE)Now you can test the CLI.
In terminal:
# Train the model
Rscript model.R train --data training_data.csv
# Generate predictions
Rscript model.R predict --historic historic.csv --future future.csv \
--output predictions.csv
# Display model info
Rscript model.R infoThe CLI automatically handles:
- Loading CSV files
- Converting to tsibbles
- Saving the model as RDS
- Writing predictions to CSV (converting nested samples to wide format)
Step 9: Create MLproject for chap-core
To run your model with chap-core, you need an MLproject file. The SDK
can generate this automatically using
generate_mlproject().
Generate the MLproject File
library(chapr)
generate_mlproject(model_name = "my_mean_model")This creates an MLproject file like:
name: my_mean_model
renv_env: renv.lock
entry_points:
train:
parameters:
train_data: str
model: str
command: Rscript model.R train --data {train_data} --model {model}
predict:
parameters:
historic_data: str
future_data: str
model: str
out_file: str
command: Rscript model.R predict --historic {historic_data} --future {future_data}
--model {model} --output {out_file}Probabilistic Models
For probabilistic forecasting, include multiple Monte Carlo samples instead of a single value:
predict_fn <- function(historic_data, future_data, saved_model,
model_configuration = list(), run_info = list()) {
n_samples <- 1000
future_data |>
left_join(saved_model$means, by = "location") |>
rowwise() |>
mutate(
# Generate 1000 samples from Poisson distribution
samples = list(rpois(n_samples, lambda = mean_cases))
) |>
ungroup() |>
select(-mean_cases)
}The samples column is a list-column where each element
is a numeric vector. The CLI automatically converts this to wide CSV
format (sample_0, sample_1, …) for Chap.
Working with Samples
The SDK provides utility functions for working with sample-based predictions:
# Convert nested samples to wide format
wide_preds <- predictions_to_wide(nested_preds)
# Convert to long format for scoringutils
long_preds <- predictions_to_long(nested_preds)
# Compute quantiles for hub submissions
quantile_preds <- predictions_to_quantiles(nested_preds)
# Add summary statistics (mean, median, CIs)
preds_with_summary <- predictions_summary(nested_preds)Configuration Schemas
You can define a configuration schema to validate user-provided settings and provide default values. The SDK uses JSON Schema (draft-07) for validation.
Defining a Schema
Use the schema helper functions to define type-safe configuration options:
my_schema <- create_config_schema(
title = "My Model Configuration",
description = "Configuration options for my forecasting model",
properties = list(
# Integer with range constraints
n_samples = schema_integer(
description = "Number of Monte Carlo samples for predictions",
default = 100L,
minimum = 1L,
maximum = 10000L
),
# Number (float) with bounds
learning_rate = schema_number(
description = "Learning rate for optimization",
default = 0.01,
minimum = 0,
maximum = 1
),
# Enum: one of a fixed set of values
method = schema_enum(
values = c("arima", "ets", "prophet"),
description = "Forecasting method to use",
default = "arima"
),
# Boolean flag
use_covariates = schema_boolean(
description = "Whether to include covariates in the model",
default = TRUE
),
# String with optional pattern constraint
date_format = schema_string(
description = "Date format for output",
default = "%Y-%m-%d"
),
# Array of values
lag_values = schema_array(
items = list(type = "integer"),
description = "Lag periods to include",
default = list(1L, 2L, 3L)
)
),
required = c("n_samples") # Mark required fields
)
# View the schema
print(my_schema)
#> Chap Configuration Schema
#> =========================
#>
#> Title: My Model Configuration
#> Description: Configuration options for my forecasting model
#>
#> Properties:
#> n_samples * (integer) [default: 100]
#> Number of Monte Carlo samples for predictions
#> learning_rate (number) [default: 0.01]
#> Learning rate for optimization
#> method (enum(arima, ets, prophet)) [default: "arima"]
#> Forecasting method to use
#> use_covariates (boolean) [default: true]
#> Whether to include covariates in the model
#> date_format (string) [default: "%Y-%m-%d"]
#> Date format for output
#> lag_values (array) [default: [1,2,3]]
#> Lag periods to include
#>
#> * = requiredUsing the Schema with CLI
Pass the schema to create_chap_cli() to enable automatic
validation:
if (!interactive()) {
create_chap_cli(train_fn, predict_fn, model_config_schema = my_schema)
}When a user provides a configuration file (YAML or JSON), the CLI will:
- Validate the config against the schema (type checking, range constraints, enum values)
- Apply defaults for any missing optional parameters
- Report errors with clear messages if validation fails
Example config.yaml:
Manual Validation
You can also validate configurations manually:
# Valid configuration
config <- list(n_samples = 500L, method = "ets")
result <- validate_config(config, my_schema)
result$valid
#> [1] TRUE
# Invalid configuration (value out of range)
bad_config <- list(n_samples = -5L)
result <- validate_config(bad_config, my_schema)
result$valid
#> [1] FALSE
result$errors
#> [1] "/n_samples: must be >= 1"
# Apply defaults to fill in missing values
partial_config <- list(n_samples = 200L)
full_config <- apply_config_defaults(partial_config, my_schema)
full_config$n_samples # User value preserved
#> [1] 200
full_config$learning_rate # Default applied
#> [1] 0.01
full_config$method # Default applied
#> [1] "arima"Available Schema Types
| Function | Description | Key Options |
|---|---|---|
schema_integer() |
Integer values |
minimum, maximum,
default
|
schema_number() |
Numeric (float) values |
minimum, maximum,
default
|
schema_string() |
String values |
min_length, max_length,
pattern, default
|
schema_boolean() |
TRUE/FALSE values | default |
schema_enum() |
One of fixed choices |
values (required), default
|
schema_array() |
Arrays/lists |
items, min_items, max_items,
default
|
Adding Config Schema to MLproject
Once you have defined a configuration schema, you can include it in your MLproject file. This exposes your model’s configuration options to chap-core users:
generate_mlproject(
model_name = "my_mean_model",
config_schema = my_schema
)This adds a user_options section to the MLproject
file:
name: my_mean_model
renv_env: renv.lock
user_options:
n_samples:
type: integer
description: Number of Monte Carlo samples for predictions
default: 100
learning_rate:
type: float
description: Learning rate for optimization
default: 0.01
method:
type: str
description: Forecasting method to use
default: arima
entry_points:
train:
parameters:
train_data: str
model: str
command: Rscript model.R train --data {train_data} --model {model}
predict:
parameters:
historic_data: str
future_data: str
model: str
out_file: str
command: Rscript model.R predict --historic {historic_data} --future {future_data}
--model {model} --output {out_file}The user_options section allows chap-core to present
configuration options to users and pass them to your model via a config
file.
Summary
The development workflow is:
-
Explore example data with
get_example_data() -
Validate with stubs using
validate_model_io()to understand requirements - Implement your train and predict functions
- Validate the implementation
-
Test against all datasets with
validate_model_io_all() -
Deploy with
create_chap_cli()
Next Steps
- See
examples/ewars_model/for a more complex example with configuration - See
examples/arima_model/for a complete example with renv and MLproject integration - Read about MLproject generation in
?generate_mlproject - Read about configuration schemas in
?create_config_schema - Explore spatial-temporal utilities in
?aggregate_temporal