I’ve written a bit about working at Google in the past. Google does a lot of things right, and other companies would benefit by following their example.
At Google, one of the technical practices that I thought was both essential and very well done was the “post-mortem”– whenever they hit a significant problem, after putting out the fires and getting everything running again, they’d get the engineers closest to the problem to spend a day or two investigating the root cause of the issue and writing up their findings for everyone to read. The visibility of post-mortems meant that even a lowly browser engineer could go read in-depth content about how a live service went down for a day (“We didn’t think about what would happen if the data center caught on fire during the migration“), or the comic tale about what happens when a catering order for 1000 donuts is misunderstood as an order for 1000 dozen donuts. Some post-mortems are even made public.
The aim was a “blameless” post-mortem (nobody got in trouble for the results) where the goal was to identify the true root causes (not just the immediately precipitating errors) and file bugs to eradicate those causes and prevent recurrence of not just the same problem, but all similar problems in the future. As a part of the process, they’d calculate out exactly how much the problem ended up costing in direct dollars (lost revenue, damage, etc).
Bugs filed from post-mortems got worked on with priority– there was solid evidence showing the real danger of leaving things unfixed, and no one wanted to get burned by the same root causes twice. Having open, broadly shared post-mortems helps ensure that the same mistakes aren’t repeated, and it helps build a common understanding of the greater impact of fire marshals over firefighters.
A key technique in the post-mortem was following the “Five Whys” paradigm (famously introduced at Toyota) for finding root causes, in which the participants would start at the immediate issue and then probe further toward the root causes by asking “And why did that happen?” (The downtime was caused because the database ran out of space and the code didn’t notice. Why? Because there was no test for that case. Why? Because the test environment ran on different hardware with a mock database that couldn’t run out of space. Why? Because it was deemed too difficult to test on production-class hardware. Why? Because we haven’t prioritized building a parallel test environment. Why? Because it’s expensive and we didn’t think it was necessary. Now we know better).
The post-mortems were serious affairs — mandatory, well-funded (engineering time is expensive), and broadly reviewed — all of them published on an intranet portal for anyone in the company to learn from. They were tremendously effective — fixes for the root causes were prioritized based on cost and impact and rapidly addressed. I don’t think Google could have become a trillion-dollar company without them.
Many companies’ engineering cultures have adopted post-mortems in theory— but if your culture isn’t willing to expect, fund, recognize, and respect them, they become yet another source of overhead and another exhausting checkbox to tick.