Site Reliability Engineering: How to Manage Incidents
Incident management is a formal process, and not every alert will trigger it.
Incident management is one of the important responsibilities of the Site Reliability Engineering team. It’s the on-call person who gets alerted to an incident and starts the investigation. Serious incident often needs to bring in people from other teams, developers to help understand a recent change, or the network team to look at traffic routing.
That’s when incidents need clear processes for control, coordination, and communication; otherwise, things quickly spiral out of control with lots of people pursuing their own ideas.
SRE defines three roles in incident management;
- The incident commander, IC, who owns the incident,
- The operations lead, OL, who owns the investigation, and
- The communications lead, CL, owns keeping people informed.
The ops lead and the comms lead may bring in other people to help, in which case the teams report to them and the leads report to the incident commander.
It’s important to keep those roles separate because you can’t focus on the investigation if you’ve got people calling you every 5 to 10 minutes for an update or if you’re trying to round up the engineering team.
Incident management is a formal process, and not every alert will trigger it. Some alerts will be clearly understood and could have a playbook, which the on-call SRE just needs to follow until that process gets automated.
But significant issues need to be handled as incidents. That could be at the SRE’s discretion. If they think the issue is bad enough, it might become an incident immediately, or it could be a tricky issue where the SRE hasn’t made progress in some reasonable time period, say 1 hour, or anything that impacts multiple teams or paying customers.
It becomes an incident as soon as an incident commander is appointed, and that could be the SRE who’s on call and is already looking at the issue, or they could ask another SRE to be commander so they can continue the investigation as the ops lead.
The commander can then appoint a comms lead, or they may own comms themselves initially. Communication starts with an incident document, which is a living document. You need a shared collaboration system to host it so that multiple people can make simultaneous edits and you can publish access on the network to whoever needs it.
As the incident progresses, the ops lead may call more people in, and they’ll communicate on a different channel, maybe through Slack or a group messaging system where bots can also post updates, so the incident channel might get updates from the monitoring system.
The comms lead will eavesdrop on that channel and post relevant updates to the incident document, as well as keep the stakeholders directly updated.
The incident commander is ultimately responsible for making the decisions, say if the ops lead has some different alternatives for fixing the issue or the comms lead wants to make a public statement. But none of these roles are 24/7 until the incident is fixed.
Shifts end and other SREs come on call, so the roles can be transferred. The incident document should have all the information for a smooth handover, and the commander handover needs to be explicit, so the incident always has an active commander.
This incident management process ensures everyone knows what they’re doing with a clear chain of command. It gives the investigative space to find a solution, and it keeps stakeholders updated.
When the investigation is done and the problem is fixed or mitigated, then the incident closes down and everyone returns to their normal roles.
The incident document lives on as the basis for the postmortem, which the incident team will produce.