Developing custom models with CHAP¶
CHAP is designed to allow model developers to easily develop their own models outside CHAP and use CHAP to benchmark/evaluate their models, or to import and use utility functions from CHAP in their own models.
We here provide guides for implementing custom models in Python and R. The recommended flow is slightly different for the two different languages, but the general idea is the same.
Developing custom models in Python¶
Code base structure¶
We recommend that you develop your model through a custom Python project and not inside the CHAP codebase. Your Python code should have command line entry points for training the model and predicting based on a trained model. This could e.g. simply be two Python files that are run with some command line arguments or a command line interface (e.g. built with something like argparse or typer).
Your code base should as a minimum have:
An entry point for training the model (e.g. a file called train.py)
An entry point for predicting based on a trained model (e.g. a file called predict.py)
An MLProject configuration file for your model that specifies the entry points (se the section about integration with CHAP below)
An easy way to get started is to clone our example barebone repository for a Python model, which can be found here. This will give you a train.py and predict.py file that you can use as starting points, as well with an MLProject configuration file.
Step 1: Test/develop your model outside CHAP¶
The following is a suggested workflow that can be used when developing and testing your model. For ease of development, we recommend a workflow where you can run your model without fully integrating it with CHAP first. This makes it easier to debug and test your model in isolation. You should still make sure your model handles the data formats that CHAP uses. The easiest way is to test directly on example data provided by CHAP. You can find such example data here.
Download the files from that directory, and test that you can train a model using the training_data.csv. You should write your trained model to file using the file name that is provided to your train method.
Here is an example of a train.py file that does a simple linear regression with the provided example data:
import pandas as pd
from sklearn.linear_model import LinearRegression
import joblib
def train(csv_fn, model_fn):
df = pd.read_csv(csv_fn)
features = ['rainfall', 'mean_temperature']
X = df[features]
Y = df['disease_cases']
Y = Y.fillna(0) # set NaNs to zero (not a good solution, just for the example to work)
model = LinearRegression()
model.fit(X, Y)
joblib.dump(model, model_fn)
train('example_data/v0/training_data.csv', 'model.pkl')
Note that a model is written to file. Your predict code needs to take this model as input and use it to make predictions. The prediction entry point needs to take these files as input:
A model file name
A file with historic data (data before the prediction period)
A file with future climate data (for the period we want to predict cases for)
A file name that will be used when writing the predictions
The following shows an example of a prediction script. Note that we don’t write the actual predictions to file, but we write samples that represent possible outcomes. How to sample predictions would depend on the model in use – here we sample from a normal distribution based on the predicted outcomes.
def predict(model_fn, historic_data_fn, future_climatedata_fn, predictions_fn):
df = pd.read_csv(future_climatedata_fn)
cols = ['rainfall', 'mean_temperature']
X = df[cols]
model = joblib.load(model_fn)
predictions = model.predict(X)
train_data = pd.read_csv(historic_data_fn)
y_train = train_data['disease_cases']
X_train = train_data[cols]
# Estimate the residual variance from the training data
residuals = y_train - model.predict(X_train)
residual_variance = np.var(residuals)
# Generate sampled predictions by adding Gaussian noise
n_samples = 20 # Number of samples you want
sampled_predictions = []
for i in range(n_samples):
noise = np.random.normal(0, np.sqrt(residual_variance), size=predictions.shape)
# add the samples to the dataframe we write as output
df[f'sample_{i}'] = predictions + noise
df.to_csv(predictions_fn, index=False)
predict('model.pkl', 'example_data/v0/historic_data.csv', 'example_data/v0/future_data.csv', 'predictions.csv')
Make sure you are able to train your model and generate samples for predictions before moving on to the next step. Make sure you have output columns called sample_0, sample_1, etc. in your predictions file.
Step 2: Running your model through CHAP¶
If your model is able to generate samples to a csv file as shown above, it should be fairly easy to run the model through the CHAP command line interface. Make sure you have chap-core installed before continuing.
The benefit of running your model through CHAP is that you can let CHAP handle all the data processing and evaluation, and you can easily compare your model to other models. To do this, you need to create an MLProject configuration file for your model. This file should specify the entry points for training and predicting, as well as any dependencies your model has. You can then run your model through CHAP using the CHAP CLI.
Here is an example of an MLProject configuration file for the example model above:
name: some_model_name
adapters: {'disease_cases': 'disease_cases',
'location': 'location',
'time_period': 'time_period',
'rainfall': 'rainfall',
'mean_temperature': 'mean_temperature'}
entry_points:
train:
parameters:
train_data: path
model: str
command: "python train.py {train_data} {model}"
predict:
parameters:
historic_data: path
future_data: path
model: str
out_file: path
command: "python predict.py {model} {historic_data} {future_data} {out_file}"
The important part here is that the entry points for train and predict give commands that work with your model. These will be run inside the directory containing your model code. If creating a train.py and predict.py like shown above, you need to make sure these can be run through the command line. See examples of how this can be done in the minimalist_example repository.
Place this MLProject file in the same directory as your model code. You can then run your model through CHAP using the following command:
$ chap evaluate --model-name /path/to/your/model/directory --dataset-name ISIMIP_dengue_harmonized --dataset-country brazil --report-filename report.pdf --ignore-environment --debug
Note the –ignore-environment. This means that we don’t ask CHAP to use Docker or a Python environment when running the model. Instead the model will be run directly using the current environment you are in. This usually works fine when developing a mode, but requires you to have both chap-core and the dependencies of your model available. The next step shows how to run your model in an isolated environment.
If the above command runs without any error messages, you have successfully evaluated your model through CHAP, and a file report.pdf should have been generated with predictions for various regions.
A folder runs/model_name/latest should also have been generated that contains copy of your model directory along with data files used. This can be useful to inspect if something goes wrong.
Step 3: Defining an environment for your model¶
CHAP currently supports specifying a docker image or a python environment file that will be used when running your model.
We implement the MLProject standard, as described in the MLflow documentation (except for conda support). Specifying a Python environment requires that you have pyenv installed and available.