Alerting¶
Alerts are stateful and transition-only: a sustained outage produces one message on entry and one on recovery, not a ping every cron tick. Two built-in transports ship in the box, plus a subclassing hook for anything else.
| Name | Transport | Use when |
|---|---|---|
slack |
Slack Incoming Webhook (Block Kit + colored attachments) | Slack-native ops chat |
webhook |
Generic HTTP POST with a canonical JSON envelope | PagerDuty Events, OpsGenie, internal incident bus, n8n / Make, a custom receiver, ... |
Run chap-checker alerts list for a copy-paste TOML snippet per alerter
(the snippets live on each alerter class, so adding a new transport is
one class + one snippet — no doc-generation step). Run
chap-checker alerts test to fire a synthetic transition through every
configured alerter without waiting for a real outage.
Per-instance opt-in¶
alerts = [...] on an instance lists which configured alerters fire for
it. The default is [] — no opt-in, no dispatch (the transition is
still tracked in the state file, so flipping the opt-in on later doesn't
backfill spurious "first failure" pings for sustained outages).
[instances.prod] # fans out to both alerters on every transition
url = "https://dhis2.example.com"
alerts = ["slack", "ops-webhook"]
[instances.staging] # silent, tracked but never paged
url = "https://staging.dhis2.example.com"
# no `alerts =` line, default is []
The string in alerts = [...] is the TOML section name, not the
transport. alerts = ["slack-prod", "slack-staging"] works fine if you
configure two [alerts.slack-prod] / [alerts.slack-staging] blocks
(see Per-channel routing
below).
Transports¶
Slack¶
Setup¶
- Open https://api.slack.com/apps and Create New App → From scratch (or reuse an existing app).
- Open Features → Incoming Webhooks and toggle activation on.
- Click Add New Webhook to Workspace and pick the channel the alerts
should land in. The channel is baked into the URL Slack gives you back —
chap-checker has no separate
channelsetting because Slack doesn't accept one on Incoming Webhooks. - Copy the generated URL — it looks like
https://hooks.slack.com/services/T.../B.../.... - Add an
[alerts.slack]block:
[alerts.slack]
webhook_url_env = "SLACK_WEBHOOK_URL"
# webhook_url = "https://hooks.slack.com/services/..." # alternative
# notify_on = ["fail", "error", "warn"] # default
# timeout_s = 10.0 # default
Treat the webhook URL as a credential — anyone who has it can post to
the channel. Prefer webhook_url_env; if you must use webhook_url
inline, keep the config chmod 0600. The permission-warning loader
nags you when it isn't.
Slack's own full guide on Incoming Webhooks lives at https://api.slack.com/messaging/webhooks.
TOML options¶
| Field | Type | Default | Notes |
|---|---|---|---|
webhook_url |
string | — | Inline URL. Mutually exclusive with webhook_url_env. |
webhook_url_env |
string | — | Name of an env var holding the URL. Recommended. Mutually exclusive with webhook_url. |
notify_on |
list of Status | ["fail", "error", "warn"] |
Which target statuses fire. Drop "warn" if you want only hard failures to page. "ok" would fire on recovery. |
timeout_s |
float | 10.0 |
HTTP timeout for the POST. Must be > 0. |
Exactly one of webhook_url / webhook_url_env must be set. Config load
fails fast if both or neither are present.
Message shape¶
Each Slack post is a Block Kit message wrapped in a legacy attachment with a colored left border so the channel reader sees red / yellow / green at a glance:
| Status | Hex | Slack swatch |
|---|---|---|
| FAIL / ERROR | #E01E5A |
red |
| WARN | #ECB22E |
yellow |
| recovery (now OK) | #2EB67D |
green |
The header summarises the run; one attachment per transition follows with target name, URL, check name, status, and the operator-facing message from the check. Rendered in a Slack client, a two-transition batch looks roughly like:
┌─────────────────────────────────────────────────────────────────┐
│ chap-checker: 1 new failure, 1 recovery │
├─────────────────────────────────────────────────────────────────┤
│ ▍ FAILURE — `prod` │ ← red bar
│ ▍ https://dhis2.example.com │
│ ▍ `dhis2_chap_ping` *FAIL* DHIS2 route returned 502 - │
│ ▍ chap-core did not respond. │
├─────────────────────────────────────────────────────────────────┤
│ ▍ RECOVERY — `staging` │ ← green bar
│ ▍ https://staging.dhis2.example.com │
│ ▍ `dhis2_ping` *OK* /api/me returned 200 with username "ops". │
└─────────────────────────────────────────────────────────────────┘
Per-channel routing (fan out or split by channel)¶
The alerts = [...] list takes section names, so two [alerts.slack-*]
blocks let a single deployment page two different channels with
different opt-in rules:
# Loud channel — fires on warn / fail / error.
[alerts.slack-ops]
webhook_url_env = "SLACK_OPS_WEBHOOK_URL"
notify_on = ["fail", "error", "warn"]
# Quiet channel — only the high-severity stuff, for the on-call rotation.
[alerts.slack-oncall]
webhook_url_env = "SLACK_ONCALL_WEBHOOK_URL"
notify_on = ["fail", "error"]
timeout_s = 5.0
[instances.prod]
url = "https://dhis2.example.com"
alerts = ["slack-ops", "slack-oncall"]
[instances.staging]
url = "https://staging.dhis2.example.com"
alerts = ["slack-ops"] # WARN/FAIL hit ops chat, on-call stays quiet
Both blocks register a SlackAlerter; the section name is just the
opt-in key. Same pattern works for webhook (e.g.
[alerts.webhook-pagerduty] and [alerts.webhook-internal-bus]).
Failure modes¶
| Symptom | Most likely cause |
|---|---|
HTTP 404 from the webhook |
Slack revoked the webhook (app deleted, channel archived, workspace token rotated). Generate a new URL and update SLACK_WEBHOOK_URL. |
HTTP 429 |
Slack's per-webhook rate limit. The dispatcher will log the failure and retry on the next refresh tick — no special handling. If sustained, route to a less-busy channel. |
HTTP 500 / 502 / 503 from hooks.slack.com |
Slack incident. Retried on next tick (see Delivery failure retry). |
httpx.ConnectError / httpx.TimeoutException |
Network egress problem from the daemon. Retried on next tick. |
2xx but the message never lands in the channel |
Wrong webhook URL pointing at a different channel, or the channel was archived. Recreate the webhook against the right channel. |
Rate limits¶
Slack's Incoming Webhooks are rate-limited per webhook URL; the public
guidance is "roughly one message per second per channel" and burst
overshoot returns HTTP 429 (Slack
docs). chap-checker batches
all transitions from one refresh tick into a single Block Kit message,
so under normal operation you're well below the limit — a five-instance
deployment that all flip in the same tick is still one POST. The limit
is only a concern if you set --interval 2 and every instance flaps
hard every few seconds, which usually points at a different problem.
Generic webhook¶
Use this for any receiver that accepts an application/json POST:
PagerDuty Events, OpsGenie, an internal incident bus, n8n / Make
automations, a custom Flask endpoint, etc.
Setup¶
[alerts.webhook]
url_env = "MY_WEBHOOK_URL" # or url = "https://..." (exactly one)
notify_on = ["fail", "error", "warn"] # default
timeout_s = 10.0 # default
headers = { "Authorization" = "Bearer ..." } # optional literal HTTP headers
TOML options¶
| Field | Type | Default | Notes |
|---|---|---|---|
url |
string | — | Inline URL. Mutually exclusive with url_env. |
url_env |
string | — | Name of an env var holding the URL. Recommended. Mutually exclusive with url. |
notify_on |
list of Status | ["fail", "error", "warn"] |
Statuses that fire (same shape as Slack). |
timeout_s |
float | 10.0 |
HTTP timeout for the POST. |
headers |
table of {name: value} |
{} |
Extra HTTP headers. Typical use: Authorization / API-key headers. Values are literal — see Auth and credentials for the env-var caveat. |
HTTP request shape¶
POST {url}
Content-Type: application/json
Authorization: <whatever headers you set> # only if `headers = {...}`
<JSON body, see below>
The alerter raises (and the dispatcher logs + skips the state save, so
the transition is retried next run) when the response is >= 400.
Anything else, including 2xx with an unexpected body, is treated as
delivered.
Canonical JSON envelope (stable across versions)¶
Every POST carries this top-level shape:
| Field | Type | Notes |
|---|---|---|
checker_version |
string | Sender's version, e.g. "0.8.2". Use this to gate against future schema changes. |
summary.failures |
integer | Count of transitions with kind == "failure" in this batch. |
summary.recoveries |
integer | Count of transitions with kind == "recovery" in this batch. |
transitions |
array of Transition |
One entry per status flip. Length is always summary.failures + summary.recoveries. |
A single POST may carry multiple transitions when several instances
flip in the same refresh cycle. Receivers should iterate transitions
rather than assuming length == 1.
Transition object¶
Every entry under transitions[] looks like this:
| Field | Type | Notes |
|---|---|---|
kind |
"failure" | "recovery" |
failure = OK → non-OK; recovery = non-OK → OK. |
target_name |
string | The [instances.<name>] key from the config. |
target_url |
string | The DHIS2 base URL of the affected instance. |
check_name |
string | Check that flipped, e.g. dhis2_ping, dhis2_chap_route. Run chap-checker checks list for the full set. |
previous_status |
"ok" | "warn" | "fail" | "error" | "skipped" |
Status before the flip. |
current_status |
same enum | Status after the flip. |
message |
string | Operator-facing detail from the check (e.g. "DHIS2 route returned 502 - chap-core did not respond."). |
duration_ms |
number | How long the check took during the run that detected the flip. |
occurred_at |
string (ISO-8601, UTC) | Timestamp the runner detected the flip. |
Example — failure transition¶
A single check on a single target flipped from OK to FAIL:
{
"checker_version": "0.8.2",
"summary": { "failures": 1, "recoveries": 0 },
"transitions": [
{
"kind": "failure",
"target_name": "prod",
"target_url": "https://dhis2.example.com",
"check_name": "dhis2_chap_ping",
"previous_status": "ok",
"current_status": "fail",
"message": "DHIS2 route returned 502 - chap-core did not respond.",
"duration_ms": 123.4,
"occurred_at": "2026-05-16T11:30:00Z"
}
]
}
Example — recovery transition¶
The same check coming back to OK on the next refresh:
{
"checker_version": "0.8.2",
"summary": { "failures": 0, "recoveries": 1 },
"transitions": [
{
"kind": "recovery",
"target_name": "prod",
"target_url": "https://dhis2.example.com",
"check_name": "dhis2_chap_ping",
"previous_status": "fail",
"current_status": "ok",
"message": "chap-core responded (status 200).",
"duration_ms": 87.1,
"occurred_at": "2026-05-16T11:35:00Z"
}
]
}
Example — mixed batch (multiple transitions in one POST)¶
Two instances flipping in the same refresh tick share one envelope:
{
"checker_version": "0.8.2",
"summary": { "failures": 1, "recoveries": 1 },
"transitions": [
{
"kind": "failure",
"target_name": "staging",
"target_url": "https://staging.dhis2.example.com",
"check_name": "dhis2_ping",
"previous_status": "ok",
"current_status": "fail",
"message": "Authentication rejected (401) on /api/me - credentials no longer valid.",
"duration_ms": 540.0,
"occurred_at": "2026-05-16T11:30:02Z"
},
{
"kind": "recovery",
"target_name": "prod",
"target_url": "https://dhis2.example.com",
"check_name": "dhis2_chap_route",
"previous_status": "fail",
"current_status": "ok",
"message": "DHIS2 route 'chap' is enabled.",
"duration_ms": 92.5,
"occurred_at": "2026-05-16T11:30:02Z"
}
]
}
Trying it locally¶
Spin up a one-shot receiver and point chap-checker at it. Python's
stdlib http.server doesn't accept POST requests (you'd get back a
501 Unsupported method ('POST')), so use a tiny dedicated handler:
# Terminal A: prints the body to stdout, replies 204. Ctrl+C to stop.
python3 -c "
from http.server import BaseHTTPRequestHandler, HTTPServer
class H(BaseHTTPRequestHandler):
def do_POST(self):
n = int(self.headers.get('content-length', 0))
print(self.path, self.rfile.read(n).decode(), flush=True)
self.send_response(204); self.end_headers()
HTTPServer(('127.0.0.1', 9999), H).serve_forever()
"
# (or use https://webhook.site for a shareable URL.)
# Terminal B: minimal config
cat > /tmp/wh.toml <<'EOF'
[instances.play]
url = "https://play.im.dhis2.org/dev"
username = "admin"
password = "district"
alerts = ["webhook"]
checks = ["dhis2_ping", "dhis2_system_info"]
[alerts.webhook]
url = "http://127.0.0.1:9999/notify"
notify_on = ["fail", "error", "warn"]
EOF
chmod 600 /tmp/wh.toml
# Fire a synthetic round-trip without waiting for a real outage.
chap-checker alerts test --config /tmp/wh.toml --kind both
Custom payload shape¶
If the receiver wants a different body (Slack Block Kit, Teams Adaptive
Card, PagerDuty Events v2, your own JSON contract), subclass
WebhookAlerter and override _build_payload(transitions) -> dict. The
HTTP transport, timeout, header handling, and >= 400 raise behavior
come from the base — that's exactly how SlackAlerter is implemented.
# my_alerter.py
from chap_checker.alerts.base import Transition, register_alerter
from chap_checker.alerts.webhook import WebhookAlerter
from typing import Any, ClassVar
@register_alerter("pagerduty")
class PagerDutyEventsAlerter(WebhookAlerter):
name: ClassVar[str] = "pagerduty"
description: ClassVar[str] = "PagerDuty Events API v2 - one event per transition."
def _build_payload(self, transitions: list[Transition]) -> dict[str, Any]:
# PagerDuty wants one POST per event; this example only emits the
# first transition. Real code would call notify() in a loop, or
# adapt the base to support multiple POSTs per notify call.
t = transitions[0]
return {
"routing_key": "<your-integration-key>",
"event_action": "trigger" if t.kind == "failure" else "resolve",
"dedup_key": f"{t.target_name}:{t.check_name}",
"payload": {
"summary": f"{t.target_name} {t.check_name}: {t.message}",
"source": t.target_url,
"severity": "error" if t.kind == "failure" else "info",
},
}
Auth and credentials¶
Any literal headers (bearer tokens, basic-auth, API keys) go in the
headers dict. Keep the config chmod 0600 since values are plaintext
for now — the permission-warning loader nags you when they're inline
and the file is group/world-readable. Env-var substitution per header
(e.g. Authorization = "env:WEBHOOK_TOKEN") is a planned follow-up —
until then, either:
- Put the token in the file and rely on filesystem permissions, or
- Subclass
WebhookAlerterand read the token fromos.environinside__init__.
Testing the wiring¶
Verify the dispatch path without waiting for a real outage:
chap-checker alerts test # send through every configured alerter
chap-checker alerts test --name slack # send through one
chap-checker alerts test --kind both # fire both an OK->FAIL and a FAIL->OK
chap-checker --json alerts test # parseable AlertTestReport JSON
The command posts a synthetic transition to each configured alerter
(Slack post, webhook POST, anything custom you've wired up). The default
--kind failure posts an OK→FAIL; --kind recovery posts a FAIL→OK;
--kind both sends both transitions. Exit code reflects per-alerter
delivery success.
Warning
alerts test posts real messages to every configured receiver
on every invocation — Slack channels get a real ping, webhook
endpoints get a real POST. Run it manually after credential
rotation, not on a cron.
Transition semantics¶
State is persisted to ./chap-checker.state.json (override with
--state <path>). Every run computes the diff against the saved state
and decides whether to dispatch.
| Previous | Current | Fires? |
|---|---|---|
| OK | FAIL / ERROR / WARN | yes (failure) |
| FAIL / ERROR / WARN | OK | yes (recovery) |
| FAIL ↔ ERROR ↔ WARN | (intra-failure flip) | no |
| any | SKIPPED | no (SKIPPED never persists) |
| SKIPPED | any | depends on the previous real status |
The "two-sided guard": at least one side of the transition must be OK,
otherwise we're just relabeling an ongoing outage.
The "skip-through-SKIPPED" rule: SKIPPED results are not written to
the state file. If dhis2_chap_route was FAIL, then upstream
dhis2_ping flapped and made it SKIPPED, then everything recovered
and it's OK again, the FAIL→OK transition still fires (the state
remembered FAIL across the SKIPPED window).
Delivery failure retry¶
If an alerter throws (5xx response, transport error, malformed reply), the dispatcher swallows the exception so it can't change the run's exit code — but it does not save the new state. The next run recomputes the same transition and retries. Operationally this means: a Slack (or webhook receiver) outage during a real failure produces one alert on the next cron tick after the receiver recovers, not zero alerts.
The retry is per-batch, not per-alerter: if a single instance fans out
to both slack and webhook and Slack errors while the webhook
succeeds, the state isn't saved and both receivers see the
transition again on the next tick. Per-alerter dedupe is a deliberate
non-goal — operators running with two transports usually want
at-least-once on every transport rather than partial silence.
State file¶
Schema (you should rarely need to look at this):
{
"version": 1,
"states": {
"prod::dhis2_ping": { "status": "ok", "since": "2026-05-13T15:00:00Z" },
"prod::dhis2_chap_route": { "status": "fail", "since": "2026-05-13T15:05:12Z" }
}
}
Corrupt or schema-mismatched files log a warning and are treated as empty — alert bookkeeping must not fail a cron run that is otherwise fine. The file is left on disk for human inspection.
The parent directory is created on save (mkdir -p) so --state
/var/lib/chap-checker/state.json on a fresh host doesn't crash
dispatch. Individual writes use unique tmp files via tempfile.mkstemp
so overlapping runs don't race on os.replace.
Concurrency lock¶
The full load → compute → dispatch → save cycle is serialised across
processes by an fcntl.flock on a sidecar <state>.lock file. Without
the lock, two overlapping runs — an overdue cron tick colliding with a
manual chap-checker verify, or tui (local mode) and serve --alerts
sharing one state file — could each read the same prior state and
re-emit the same transition, duplicating every alert to every receiver.
- Acquisition: each runner opens the lock file
O_RDWR | O_CREAT, takes an exclusive non-blocking flock, and polls withasyncio.sleep(0.1)so the daemon's event loop keeps serving HTTP while it waits. - Timeout: 30 seconds by default. On timeout the dispatcher logs at WARNING and skips that tick's dispatch (state is left as-is so the next tick recomputes the same transitions). The check loop and the run's exit code are unaffected.
- Footprint: a zero-byte
<state>.lockfile is created next to the state file the first time dispatch runs. It is safe to delete when no chap-checker process is running.
Windows
The lock uses fcntl, which ships only on POSIX. On Windows the
context manager falls through to no-lock semantics — every write is
still atomic via os.replace, but two concurrent dispatch
processes on the same state file can still duplicate alerts. The
supported pattern on Windows is to run a single dispatch process:
either chap-checker serve --alerts or a cron-driven verify,
not both against the same state file.