Post-Mortems and Blameless Culture
Post-mortems (also called incident retrospectives) are structured reviews of what went wrong after a significant system failure or service incident. Done well, they are the most powerful learning mechanism available to engineering teams. Done poorly — or avoided entirely — they leave teams vulnerable to repeating the same failures.
Blameless Post-Mortems
Blameless post-mortems focus on understanding system failures rather than assigning individual blame. This is not moral permissiveness — it is practical engineering. Individual blame creates incentives to hide problems and avoid risks. System-focused analysis identifies the contributing factors that made individual errors possible and likely — and produces changes that actually prevent recurrence.
Post-Mortem Structure
- Timeline: Factual, chronological account of events — when things happened, who was involved, what actions were taken
- Impact: How many users affected, for how long, what data or transactions were affected
- Root cause analysis: The "5 Whys" — keep asking why until you reach the underlying systemic cause
- Contributing factors: What made the root cause possible? Missing monitoring, unclear runbooks, deployment process gaps?
- Action items: Specific, owned, time-bound changes that will prevent recurrence
Running a Blameless Meeting
Establish psychological safety before the meeting: the purpose is learning, not punishment. Encourage all participants to share their perspective and what they observed. The facilitator should be a neutral party where possible. Focus on "what happened" not "who caused it".