Runbooks

Operational playbooks for production alerts and recurring procedures.

Per Section 17.6 of docs/PRD/17-engineering-standards.md, every alert must have a paired runbook. CI rejects MRs that add a PrometheusRule without a matching, schema-valid runbook in this directory.

Two audiences

A runbook serves both:

Reader	What they need
Human at 3am	Calm, ordered steps. "What's broken, who's affected, what do I run, how do I know it worked."
L2 on-call agent (read-only)	Machine-parseable diagnostic queries it can execute, structured impact assessment, clear handoff conditions

The structured contract below makes runbooks usable by both.

Runbook tiers

The contract scales with severity. Sev 0 / Sev 1 alerts need full runbooks; Sev 3 / Sev 4 can be pointers.

Tier	Required for	Required sections	Template
Full	Sev 0; Sev 1	frontmatter, symptoms, impact, diagnose, mitigate, verify, postmortem-trigger, references	`_template-full.md`
Standard	Sev 2	frontmatter, symptoms, impact, diagnose, mitigate, references	`_template-standard.md`
Pointer	Sev 3; Sev 4	frontmatter, symptoms, impact, references	`_template-pointer.md`

The CI gate enforces tier-appropriate requirements based on the severity_default declared in the frontmatter.

Frontmatter schema (machine contract)

id: <kebab-case-id-matching-alertname>      # required, unique
title: <Human-readable title>                 # required
domain: <one of 12 domains>                   # required, enum: identity, onboarding, accounts, billing, payouts, trading, risk-engine, audit, notifications, admin, platform, infra
severity_default: <sev0..sev4>                # required
linked_alerts: [<alert-name>, ...]            # required (≥1 for full / standard)
linked_slos: [<slo-id>, ...]                  # optional
owner: <gitlab-handle>                        # required
last_reviewed: <YYYY-MM-DD>                   # required
review_cadence_days: 30 | 90 | 180            # required (sev0=30, sev1=90, others=180)
agent_executable: true | false                # required

agent_executable: true means the L2 agent may execute the Diagnose queries autonomously (read-only). false means human-only.

Diagnose step format (machine contract)

The L2 agent extracts Diagnose blocks by parsing markdown headings and fenced code blocks. Authoring rules:

Each step gets an explicit ID: D1, D2, D3, …
Each step's heading is ### D<n> — <Short label>.
Each step contains exactly one fenced code block immediately under the heading.
The code block is tagged with the executor: promql, logql, bash, psql, or sql.
Each step ends with What this tells you: a one-sentence interpretation of the output.

Anti-pattern: prose-embedded queries. The agent cannot extract them.

### D1 — Check current error rate by route

​```promql
topk(10,
  sum by (route, code) (rate(http_requests_total{job="api",code=~"5.."}[5m]))
)
​```

**What this tells you**: which route(s) are failing.

Mitigate step format

Each step gets an explicit ID: M1, M2, M3, …
Each step's heading is ### M<n> — <First mitigation>.
Each step declares Who can execute (anyone on call or founder only) and Side effects.
Each step contains commands in fenced code blocks.

The L2 agent at Phase 1 will never auto-execute Mitigate steps. Phase 2 (L3) introduces constrained Mitigate execution; until then, mitigation is human-driven.

Authoring workflow

For a new alert:

Write the alert rule with runbook_url: https://docs.astrixtrading.com/runbooks/<id> annotation.
Open the MR. CI fails: "missing runbook".
Copy the appropriate template (full / standard / pointer) to docs/runbooks/<id>.md.
Fill it in. The L2 agent can pre-fill the Diagnose section based on the alert's PromQL and relevant_log_filters annotation — let it.
Push. CI re-runs and validates frontmatter + structure.
Founder reviews the Mitigate section if it touches money-flow or destructive ops (gates-as-review per Section 17.3); otherwise auto-merges on green.

Drift prevention

Layer 4 audit (scripts/audit-runbook-freshness.ts) runs daily:

If today - last_reviewed > review_cadence_days: warn (Slack message to #alerts-monitoring)
If today - last_reviewed > 2 × review_cadence_days: hard-fail next CI run touching the runbook

Cross-runbook patterns

Shared content lives in _partials/ and is referenced via Markdown link + docs-portal include. Keeps maintenance cheap as runbook count grows.

Lifecycle

Stage	Trigger
Draft	Created with a new alert; agent may have pre-filled
Active	Merged; passes all gates
Stale	`last_reviewed` exceeds cadence; warning surfaces
Archived	Linked alert deleted; runbook moves to `archive/` with a header note

CI prevents deleting a runbook while alerts still reference it.

Index

Runbook	Trigger	Severity	Tier
`api-error-rate-elevated.md`	`APIErrorRateElevated`, `APIErrorRateCritical`	sev0 / sev1	Full
`deploy-silence.md`	Operational procedure (not an alert)	n/a	n/a

More runbooks land here as Section 17.6 alerts are deployed (per the Phase 1 implementation plan in Section 17.6.6).

docs/PRD/17-engineering-standards.md Section 17.6 - alerting & on-call structure
docs/postmortems/ - retrospectives that produce runbook updates
docs/ADR/0003-observability-stack-architecture.md - observability stack rationale
docs/ADR/0004-agent-assisted-on-call.md - L1 + L2 agent design and read-only boundaries
infra/observability/prometheus/rules/ - alert rule files (every alert must reference a runbook here)

Two audiences​

Runbook tiers​

Frontmatter schema (machine contract)​

Diagnose step format (machine contract)​

Mitigate step format​

Authoring workflow​

Drift prevention​

Cross-runbook patterns​

Lifecycle​

Index​

Related​