I think part of what we’re seeing from the “learning from incidents” community is just a shift in thinking and software to say, “OK, they didn’t do something wrong. Something happened that made sense for them to do what they did,” and kind of allowing for that conversation to happen.
We need to be asking different questions and we need to give more people seats at the table. I’ve been at way too many organizations where the incident was just the [site reliability engineers] in the room. It should have had marketing in the room, it should have had PR in the room, it should have had customer service in the room, it should have had leadership in the room. But it’s thought of as kind of an SRE issue, like SREs have to prepare for any type of situation that gets thrown their way.
The flip side of #hugops is I do think there is responsibility that should be held to the leadership of those companies. We’re empathetic to the engineers that are dealing with the situation they have, but in part that’s because leadership isn’t prioritizing their actions, or resilience and reliability in the same way that they prioritize some of their product efforts.
As my colleague Dr. Richard Cook has said, we shouldn’t be surprised that these systems go down. We should be more surprised that they stay up as often as they do.