Postmortems

Blameless retrospectives for production incidents. The artifact that turns an incident into permanent improvements.

This directory holds postmortems per Section 17.6 of docs/PRD/17-engineering-standards.md. The L2 on-call agent auto-drafts the skeleton for mandatory-trigger incidents within 1h of incident close; the assigned author finishes the prose.

Posture

Blameless. Focus on systems and processes, not individuals.
Honest. Don't sanitize for hypothetical future readers. Audience is us learning.
Time-boxed. Refined within 5 business days while memory is fresh.
Action-oriented. Every postmortem closes with owned, dated action items that become GitLab issues.

When to write a postmortem

Hard triggers (mandatory)

Trigger	Tier
Any Sev 0 incident	Full
Sev 1 sustained > 15 min	Full
Any SLO violation (any of the 7 SLOs in `infra/observability/slos/`)	Full
Money-flow impact reached production	Full
Data integrity event (audit gap, hash-chain break)	Full
Security event	Full (`_template-security.md`)
Repeat incident (same root cause as one in past 30 days)	Full
Migration-induced incident	Full
Vault unsealed=0 episode (any duration)	Full
Aurora failover (planned or not)	Standard
Sev 1 ≤ 15 min	Standard

Soft triggers (founder discretion)

Trigger	Tier
Sev 2 sustained > 1h	Standard
Near-miss (would-have-been Sev 0 but caught in time)	Note
Deploy required emergency rollback	Standard
Alert fired that revealed a missing capability	Note (gate-escape per Section 17.3.10)

Not required

Sev 3 / Sev 4 routine alerts that auto-resolved
Single flaky test or single transient blip with no user impact
Vendor-induced outage where we had no agency (track in vendor incident log instead)

Templates

Tier	Use for	Template
Full	Sev 0; Sev 1 > 15 min; SLO violation; money-flow; regulatory; data integrity	`_template-full.md`
Standard	Sev 1 ≤ 15 min; Aurora failover; soft-trigger items	`_template-standard.md`
Note	Near-misses; vendor-induced; gate-escape captures	`_template-note.md`
Security	Any incident with `security_impact: true`	`_template-security.md`

Naming

YYYY-MM-DD-<short-incident-slug>.md. Examples:

2026-08-14-aurora-failover-stuck-replica.md
2026-09-03-payout-stripe-webhook-backlog.md
2026-10-21-security-vault-token-rotation-miss.md

Lifecycle

Stage	Owner	SLA
`status: draft`	Agent (auto, full / standard) or human (note)	1 h after incident close (mandatory triggers)
`status: refined`	Author writes prose, accepts/rejects agent suggestions	5 business days
`status: reviewed`	Author reviews own draft after 24h cool-off (peer review when team grows)	7 business days
`status: closed`	All action items resolved or formally re-deferred with new dates	30 days

Action items

Every action item:

Has an owner (specific GitLab handle).
Has a due date.
Has a type: corrective (fixes the immediate cause), preventive (prevents the class of cause), or doc (improves a runbook or standard).
Becomes a tracked GitLab issue with the postmortem-action label.
References the postmortem ID in the issue description.

Action items without owners and due dates do not count.

CI gates

Postmortem MRs are validated by scripts/check-postmortems.ts. Hard gates:

Schema — frontmatter validates against schema (required fields, valid enums)
Tier sections — required H2 sections present per tier
Action items — Full / Standard tiers have ≥ 1 action item with owner + due date
GitLab issue linkage — Full / Standard tiers reference issues with postmortem-action label (required at status: reviewed)

Scheduled audits (scripts/audit-postmortems.ts):

Postmortem written within 5 business days of incident close (warns at day 4; hard fail if status: draft past day 7)
Action items closed within 30 days (daily report to #alerts-monitoring)

Money-flow / regulatory / security addenda

When money_flow_impact: true, the postmortem must include a ## Money-flow accounting section (dollars at risk, recovered, lost, refunded; customers affected).

When regulatory_impact: true, the postmortem must include a ## Regulatory disclosure assessment section (counterparties, deadlines, owners, status).

When security_impact: true, use _template-security.md which has dedicated sections for breach scope, data exposure, disclosure decisions, and law enforcement contact.

Index

(No incidents yet. This index will populate as incidents occur.)

Date	Incident	Severity	SLO impact	Postmortem
-	-	-	-	-

docs/PRD/17-engineering-standards.md Section 17.6 - severity definitions, on-call structure, agent role
docs/runbooks/ - per-alert runbooks (where mitigations live)
docs/ADR/0004-agent-assisted-on-call.md - rationale for L1 + L2 read-only agent
infra/observability/slos/ - SLO definitions whose violation triggers postmortems

Posture​

When to write a postmortem​

Hard triggers (mandatory)​

Soft triggers (founder discretion)​

Not required​

Templates​

Naming​

Lifecycle​

Action items​

CI gates​

Money-flow / regulatory / security addenda​

Index​

Related​