Skip to main content

Postmortems

Blameless retrospectives for production incidents. The artifact that turns an incident into permanent improvements.

This directory holds postmortems per Section 17.6 of docs/PRD/17-engineering-standards.md. The L2 on-call agent auto-drafts the skeleton for mandatory-trigger incidents within 1h of incident close; the assigned author finishes the prose.

Posture

  • Blameless. Focus on systems and processes, not individuals.
  • Honest. Don't sanitize for hypothetical future readers. Audience is us learning.
  • Time-boxed. Refined within 5 business days while memory is fresh.
  • Action-oriented. Every postmortem closes with owned, dated action items that become GitLab issues.

When to write a postmortem

Hard triggers (mandatory)

TriggerTier
Any Sev 0 incidentFull
Sev 1 sustained > 15 minFull
Any SLO violation (any of the 7 SLOs in infra/observability/slos/)Full
Money-flow impact reached productionFull
Data integrity event (audit gap, hash-chain break)Full
Security eventFull (_template-security.md)
Repeat incident (same root cause as one in past 30 days)Full
Migration-induced incidentFull
Vault unsealed=0 episode (any duration)Full
Aurora failover (planned or not)Standard
Sev 1 ≤ 15 minStandard

Soft triggers (founder discretion)

TriggerTier
Sev 2 sustained > 1hStandard
Near-miss (would-have-been Sev 0 but caught in time)Note
Deploy required emergency rollbackStandard
Alert fired that revealed a missing capabilityNote (gate-escape per Section 17.3.10)

Not required

  • Sev 3 / Sev 4 routine alerts that auto-resolved
  • Single flaky test or single transient blip with no user impact
  • Vendor-induced outage where we had no agency (track in vendor incident log instead)

Templates

TierUse forTemplate
FullSev 0; Sev 1 > 15 min; SLO violation; money-flow; regulatory; data integrity_template-full.md
StandardSev 1 ≤ 15 min; Aurora failover; soft-trigger items_template-standard.md
NoteNear-misses; vendor-induced; gate-escape captures_template-note.md
SecurityAny incident with security_impact: true_template-security.md

Naming

YYYY-MM-DD-<short-incident-slug>.md. Examples:

  • 2026-08-14-aurora-failover-stuck-replica.md
  • 2026-09-03-payout-stripe-webhook-backlog.md
  • 2026-10-21-security-vault-token-rotation-miss.md

Lifecycle

StageOwnerSLA
status: draftAgent (auto, full / standard) or human (note)1 h after incident close (mandatory triggers)
status: refinedAuthor writes prose, accepts/rejects agent suggestions5 business days
status: reviewedAuthor reviews own draft after 24h cool-off (peer review when team grows)7 business days
status: closedAll action items resolved or formally re-deferred with new dates30 days

Action items

Every action item:

  • Has an owner (specific GitLab handle).
  • Has a due date.
  • Has a type: corrective (fixes the immediate cause), preventive (prevents the class of cause), or doc (improves a runbook or standard).
  • Becomes a tracked GitLab issue with the postmortem-action label.
  • References the postmortem ID in the issue description.

Action items without owners and due dates do not count.

CI gates

Postmortem MRs are validated by scripts/check-postmortems.ts. Hard gates:

  1. Schema — frontmatter validates against schema (required fields, valid enums)
  2. Tier sections — required H2 sections present per tier
  3. Action items — Full / Standard tiers have ≥ 1 action item with owner + due date
  4. GitLab issue linkage — Full / Standard tiers reference issues with postmortem-action label (required at status: reviewed)

Scheduled audits (scripts/audit-postmortems.ts):

  1. Postmortem written within 5 business days of incident close (warns at day 4; hard fail if status: draft past day 7)
  2. Action items closed within 30 days (daily report to #alerts-monitoring)

Money-flow / regulatory / security addenda

When money_flow_impact: true, the postmortem must include a ## Money-flow accounting section (dollars at risk, recovered, lost, refunded; customers affected).

When regulatory_impact: true, the postmortem must include a ## Regulatory disclosure assessment section (counterparties, deadlines, owners, status).

When security_impact: true, use _template-security.md which has dedicated sections for breach scope, data exposure, disclosure decisions, and law enforcement contact.

Index

(No incidents yet. This index will populate as incidents occur.)

DateIncidentSeveritySLO impactPostmortem
-----