Postmortems
Blameless retrospectives for production incidents. The artifact that turns an incident into permanent improvements.
This directory holds postmortems per Section 17.6 of docs/PRD/17-engineering-standards.md. The L2 on-call agent auto-drafts the skeleton for mandatory-trigger incidents within 1h of incident close; the assigned author finishes the prose.
Posture
- Blameless. Focus on systems and processes, not individuals.
- Honest. Don't sanitize for hypothetical future readers. Audience is us learning.
- Time-boxed. Refined within 5 business days while memory is fresh.
- Action-oriented. Every postmortem closes with owned, dated action items that become GitLab issues.
When to write a postmortem
Hard triggers (mandatory)
| Trigger | Tier |
|---|---|
| Any Sev 0 incident | Full |
| Sev 1 sustained > 15 min | Full |
Any SLO violation (any of the 7 SLOs in infra/observability/slos/) | Full |
| Money-flow impact reached production | Full |
| Data integrity event (audit gap, hash-chain break) | Full |
| Security event | Full (_template-security.md) |
| Repeat incident (same root cause as one in past 30 days) | Full |
| Migration-induced incident | Full |
| Vault unsealed=0 episode (any duration) | Full |
| Aurora failover (planned or not) | Standard |
| Sev 1 ≤ 15 min | Standard |
Soft triggers (founder discretion)
| Trigger | Tier |
|---|---|
| Sev 2 sustained > 1h | Standard |
| Near-miss (would-have-been Sev 0 but caught in time) | Note |
| Deploy required emergency rollback | Standard |
| Alert fired that revealed a missing capability | Note (gate-escape per Section 17.3.10) |
Not required
- Sev 3 / Sev 4 routine alerts that auto-resolved
- Single flaky test or single transient blip with no user impact
- Vendor-induced outage where we had no agency (track in vendor incident log instead)
Templates
| Tier | Use for | Template |
|---|---|---|
| Full | Sev 0; Sev 1 > 15 min; SLO violation; money-flow; regulatory; data integrity | _template-full.md |
| Standard | Sev 1 ≤ 15 min; Aurora failover; soft-trigger items | _template-standard.md |
| Note | Near-misses; vendor-induced; gate-escape captures | _template-note.md |
| Security | Any incident with security_impact: true | _template-security.md |
Naming
YYYY-MM-DD-<short-incident-slug>.md. Examples:
2026-08-14-aurora-failover-stuck-replica.md2026-09-03-payout-stripe-webhook-backlog.md2026-10-21-security-vault-token-rotation-miss.md
Lifecycle
| Stage | Owner | SLA |
|---|---|---|
status: draft | Agent (auto, full / standard) or human (note) | 1 h after incident close (mandatory triggers) |
status: refined | Author writes prose, accepts/rejects agent suggestions | 5 business days |
status: reviewed | Author reviews own draft after 24h cool-off (peer review when team grows) | 7 business days |
status: closed | All action items resolved or formally re-deferred with new dates | 30 days |
Action items
Every action item:
- Has an owner (specific GitLab handle).
- Has a due date.
- Has a type:
corrective(fixes the immediate cause),preventive(prevents the class of cause), ordoc(improves a runbook or standard). - Becomes a tracked GitLab issue with the
postmortem-actionlabel. - References the postmortem ID in the issue description.
Action items without owners and due dates do not count.
CI gates
Postmortem MRs are validated by scripts/check-postmortems.ts. Hard gates:
- Schema — frontmatter validates against schema (required fields, valid enums)
- Tier sections — required H2 sections present per
tier - Action items — Full / Standard tiers have ≥ 1 action item with owner + due date
- GitLab issue linkage — Full / Standard tiers reference issues with
postmortem-actionlabel (required atstatus: reviewed)
Scheduled audits (scripts/audit-postmortems.ts):
- Postmortem written within 5 business days of incident close (warns at day 4; hard fail if
status: draftpast day 7) - Action items closed within 30 days (daily report to
#alerts-monitoring)
Money-flow / regulatory / security addenda
When money_flow_impact: true, the postmortem must include a ## Money-flow accounting section (dollars at risk, recovered, lost, refunded; customers affected).
When regulatory_impact: true, the postmortem must include a ## Regulatory disclosure assessment section (counterparties, deadlines, owners, status).
When security_impact: true, use _template-security.md which has dedicated sections for breach scope, data exposure, disclosure decisions, and law enforcement contact.
Index
(No incidents yet. This index will populate as incidents occur.)
| Date | Incident | Severity | SLO impact | Postmortem |
|---|---|---|---|---|
| - | - | - | - | - |
Related
docs/PRD/17-engineering-standards.mdSection 17.6 - severity definitions, on-call structure, agent roledocs/runbooks/- per-alert runbooks (where mitigations live)docs/ADR/0004-agent-assisted-on-call.md- rationale for L1 + L2 read-only agentinfra/observability/slos/- SLO definitions whose violation triggers postmortems