Runbooks
Operational playbooks for production alerts and recurring procedures.
Per Section 17.6 of docs/PRD/17-engineering-standards.md, every alert must have a paired runbook. CI rejects MRs that add a PrometheusRule without a matching, schema-valid runbook in this directory.
Two audiences
A runbook serves both:
| Reader | What they need |
|---|---|
| Human at 3am | Calm, ordered steps. "What's broken, who's affected, what do I run, how do I know it worked." |
| L2 on-call agent (read-only) | Machine-parseable diagnostic queries it can execute, structured impact assessment, clear handoff conditions |
The structured contract below makes runbooks usable by both.
Runbook tiers
The contract scales with severity. Sev 0 / Sev 1 alerts need full runbooks; Sev 3 / Sev 4 can be pointers.
| Tier | Required for | Required sections | Template |
|---|---|---|---|
| Full | Sev 0; Sev 1 | frontmatter, symptoms, impact, diagnose, mitigate, verify, postmortem-trigger, references | _template-full.md |
| Standard | Sev 2 | frontmatter, symptoms, impact, diagnose, mitigate, references | _template-standard.md |
| Pointer | Sev 3; Sev 4 | frontmatter, symptoms, impact, references | _template-pointer.md |
The CI gate enforces tier-appropriate requirements based on the severity_default declared in the frontmatter.
Frontmatter schema (machine contract)
id: <kebab-case-id-matching-alertname> # required, unique
title: <Human-readable title> # required
domain: <one of 12 domains> # required, enum: identity, onboarding, accounts, billing, payouts, trading, risk-engine, audit, notifications, admin, platform, infra
severity_default: <sev0..sev4> # required
linked_alerts: [<alert-name>, ...] # required (≥1 for full / standard)
linked_slos: [<slo-id>, ...] # optional
owner: <gitlab-handle> # required
last_reviewed: <YYYY-MM-DD> # required
review_cadence_days: 30 | 90 | 180 # required (sev0=30, sev1=90, others=180)
agent_executable: true | false # required
agent_executable: true means the L2 agent may execute the Diagnose queries autonomously (read-only). false means human-only.
Diagnose step format (machine contract)
The L2 agent extracts Diagnose blocks by parsing markdown headings and fenced code blocks. Authoring rules:
- Each step gets an explicit ID:
D1,D2,D3, … - Each step's heading is
### D<n> — <Short label>. - Each step contains exactly one fenced code block immediately under the heading.
- The code block is tagged with the executor:
promql,logql,bash,psql, orsql. - Each step ends with
What this tells you: a one-sentence interpretation of the output.
Anti-pattern: prose-embedded queries. The agent cannot extract them.
### D1 — Check current error rate by route
```promql
topk(10,
sum by (route, code) (rate(http_requests_total{job="api",code=~"5.."}[5m]))
)
```
**What this tells you**: which route(s) are failing.
Mitigate step format
- Each step gets an explicit ID:
M1,M2,M3, … - Each step's heading is
### M<n> — <First mitigation>. - Each step declares
Who can execute(anyone on callorfounder only) andSide effects. - Each step contains commands in fenced code blocks.
The L2 agent at Phase 1 will never auto-execute Mitigate steps. Phase 2 (L3) introduces constrained Mitigate execution; until then, mitigation is human-driven.
Authoring workflow
For a new alert:
- Write the alert rule with
runbook_url: https://docs.astrixtrading.com/runbooks/<id>annotation. - Open the MR. CI fails: "missing runbook".
- Copy the appropriate template (full / standard / pointer) to
docs/runbooks/<id>.md. - Fill it in. The L2 agent can pre-fill the Diagnose section based on the alert's PromQL and
relevant_log_filtersannotation — let it. - Push. CI re-runs and validates frontmatter + structure.
- Founder reviews the Mitigate section if it touches money-flow or destructive ops (gates-as-review per Section 17.3); otherwise auto-merges on green.
Drift prevention
Layer 4 audit (scripts/audit-runbook-freshness.ts) runs daily:
- If
today - last_reviewed > review_cadence_days: warn (Slack message to#alerts-monitoring) - If
today - last_reviewed > 2 × review_cadence_days: hard-fail next CI run touching the runbook
Cross-runbook patterns
Shared content lives in _partials/ and is referenced via Markdown link + docs-portal include. Keeps maintenance cheap as runbook count grows.
Lifecycle
| Stage | Trigger |
|---|---|
| Draft | Created with a new alert; agent may have pre-filled |
| Active | Merged; passes all gates |
| Stale | last_reviewed exceeds cadence; warning surfaces |
| Archived | Linked alert deleted; runbook moves to archive/ with a header note |
CI prevents deleting a runbook while alerts still reference it.
Index
| Runbook | Trigger | Severity | Tier |
|---|---|---|---|
api-error-rate-elevated.md | APIErrorRateElevated, APIErrorRateCritical | sev0 / sev1 | Full |
deploy-silence.md | Operational procedure (not an alert) | n/a | n/a |
More runbooks land here as Section 17.6 alerts are deployed (per the Phase 1 implementation plan in Section 17.6.6).
Related
docs/PRD/17-engineering-standards.mdSection 17.6 - alerting & on-call structuredocs/postmortems/- retrospectives that produce runbook updatesdocs/ADR/0003-observability-stack-architecture.md- observability stack rationaledocs/ADR/0004-agent-assisted-on-call.md- L1 + L2 agent design and read-only boundariesinfra/observability/prometheus/rules/- alert rule files (every alert must reference a runbook here)