Skip to main content

Runbooks

Operational playbooks for production alerts and recurring procedures.

Per Section 17.6 of docs/PRD/17-engineering-standards.md, every alert must have a paired runbook. CI rejects MRs that add a PrometheusRule without a matching, schema-valid runbook in this directory.

Two audiences

A runbook serves both:

ReaderWhat they need
Human at 3amCalm, ordered steps. "What's broken, who's affected, what do I run, how do I know it worked."
L2 on-call agent (read-only)Machine-parseable diagnostic queries it can execute, structured impact assessment, clear handoff conditions

The structured contract below makes runbooks usable by both.

Runbook tiers

The contract scales with severity. Sev 0 / Sev 1 alerts need full runbooks; Sev 3 / Sev 4 can be pointers.

TierRequired forRequired sectionsTemplate
FullSev 0; Sev 1frontmatter, symptoms, impact, diagnose, mitigate, verify, postmortem-trigger, references_template-full.md
StandardSev 2frontmatter, symptoms, impact, diagnose, mitigate, references_template-standard.md
PointerSev 3; Sev 4frontmatter, symptoms, impact, references_template-pointer.md

The CI gate enforces tier-appropriate requirements based on the severity_default declared in the frontmatter.

Frontmatter schema (machine contract)

id: <kebab-case-id-matching-alertname> # required, unique
title: <Human-readable title> # required
domain: <one of 12 domains> # required, enum: identity, onboarding, accounts, billing, payouts, trading, risk-engine, audit, notifications, admin, platform, infra
severity_default: <sev0..sev4> # required
linked_alerts: [<alert-name>, ...] # required (≥1 for full / standard)
linked_slos: [<slo-id>, ...] # optional
owner: <gitlab-handle> # required
last_reviewed: <YYYY-MM-DD> # required
review_cadence_days: 30 | 90 | 180 # required (sev0=30, sev1=90, others=180)
agent_executable: true | false # required

agent_executable: true means the L2 agent may execute the Diagnose queries autonomously (read-only). false means human-only.

Diagnose step format (machine contract)

The L2 agent extracts Diagnose blocks by parsing markdown headings and fenced code blocks. Authoring rules:

  1. Each step gets an explicit ID: D1, D2, D3, …
  2. Each step's heading is ### D<n> — <Short label>.
  3. Each step contains exactly one fenced code block immediately under the heading.
  4. The code block is tagged with the executor: promql, logql, bash, psql, or sql.
  5. Each step ends with What this tells you: a one-sentence interpretation of the output.

Anti-pattern: prose-embedded queries. The agent cannot extract them.

### D1 — Check current error rate by route

​```promql
topk(10,
sum by (route, code) (rate(http_requests_total{job="api",code=~"5.."}[5m]))
)
​```

**What this tells you**: which route(s) are failing.

Mitigate step format

  1. Each step gets an explicit ID: M1, M2, M3, …
  2. Each step's heading is ### M<n> — <First mitigation>.
  3. Each step declares Who can execute (anyone on call or founder only) and Side effects.
  4. Each step contains commands in fenced code blocks.

The L2 agent at Phase 1 will never auto-execute Mitigate steps. Phase 2 (L3) introduces constrained Mitigate execution; until then, mitigation is human-driven.

Authoring workflow

For a new alert:

  1. Write the alert rule with runbook_url: https://docs.astrixtrading.com/runbooks/<id> annotation.
  2. Open the MR. CI fails: "missing runbook".
  3. Copy the appropriate template (full / standard / pointer) to docs/runbooks/<id>.md.
  4. Fill it in. The L2 agent can pre-fill the Diagnose section based on the alert's PromQL and relevant_log_filters annotation — let it.
  5. Push. CI re-runs and validates frontmatter + structure.
  6. Founder reviews the Mitigate section if it touches money-flow or destructive ops (gates-as-review per Section 17.3); otherwise auto-merges on green.

Drift prevention

Layer 4 audit (scripts/audit-runbook-freshness.ts) runs daily:

  • If today - last_reviewed > review_cadence_days: warn (Slack message to #alerts-monitoring)
  • If today - last_reviewed > 2 × review_cadence_days: hard-fail next CI run touching the runbook

Cross-runbook patterns

Shared content lives in _partials/ and is referenced via Markdown link + docs-portal include. Keeps maintenance cheap as runbook count grows.

Lifecycle

StageTrigger
DraftCreated with a new alert; agent may have pre-filled
ActiveMerged; passes all gates
Stalelast_reviewed exceeds cadence; warning surfaces
ArchivedLinked alert deleted; runbook moves to archive/ with a header note

CI prevents deleting a runbook while alerts still reference it.

Index

RunbookTriggerSeverityTier
api-error-rate-elevated.mdAPIErrorRateElevated, APIErrorRateCriticalsev0 / sev1Full
deploy-silence.mdOperational procedure (not an alert)n/an/a

More runbooks land here as Section 17.6 alerts are deployed (per the Phase 1 implementation plan in Section 17.6.6).