Five Standard Models to Work on Incidents Effectively
Every incident is different, so the best way to make sure you’re working effectively is to follow a standard model.
In my last article, I talked about How to Manage Incidents and how every alert should not trigger incidents.
Some issues are simple. The alert comes in with a link to the playbook, you follow the steps in the playbook, and close the issue down. These types of issues are good candidates for projects to replace human intervention, which I have covered in my article, How to effectively Identify and Measure Toil as Site Reliability Engineer.
The other type of incident needs thoughtful investigation, and for some SREs, tracking down a really awkward production issue is the most fun part of the job. Every incident will be different, so the best way to make sure you’re working effectively is to follow a standard model.
An SRE proposes this workflow:
- Test, and
That’s a clear model where each stage has expected outputs, and it really helps to put structure around your investigation.
The input which starts the workflow is your alert, which could be;
- An automated response to a failing SLO with all the details of the metrics that triggered it, or
- It could be a problem report from a human, which ideally states
- the expected behavior,
- the actual behavior, and
- the steps to reproduce.
This is the first stage in that investigation, where the aim is to get the system back to a healthy state as quickly as possible.
This is about remediation before any active investigation, which you need to do to stop the problem from escalating. These are tactics like,
- Scaling up computing with more servers,
- Rerouting traffic to other data centers, or
- Throttling incoming requests or
- Switching the whole system to read‑only mode.
Your goal here is not to fix the problem but to try and restore service as best as you can straightaway before you focus on the real fix.
Adding more servers can bring latency down, and switching to read‑only mode can prevent data corruption, and it’s better to take steps like these first because it reduces the impact of the incident.
The output of this stage should ideally be a stable system, which gives you the space to investigate the problem more fully.
The examination stage is where you try to understand the nature of the problem and what triggers it.
This is where the effort you put into monitoring should pay off because your dashboards can show you trends of the issue getting worse or even a spike when it first happened, which could point to a window for a release or a config change.
This stage is where you’ll dig into system logs, and hopefully, you’ll have structured logs feeding into a central system with a search UI. If you have details of a process that is failing, you might be able to follow it through in the logs.
If your systems use some kind of transaction ID in their log entries, you can see the progress through different components. A service graph is another very useful tool here, which can visualize the flow of traffic in your system and show potential failure points. And distributed tracing which records the request chain for HTTP calls can be a great way to narrow down latency issues.
The output of the examination stage should be a clear understanding of the type of problem and the behavior that triggers it.
The diagnosis stage is where you’re going to try and find the cause of the problem. A good approach here is to take a vertical slice through the system and look at what components are involved, how they connect, and what the flow of data is.
You may be able to trace the expected stack and see how far the process gets when the issue hits. The four Ws will help here.
- What is the system doing?
- Why is the behavior not as expected?
- Where are the resources going if the CPU has started to spike or data doesn’t end up where it should? and
- When did this start? Did it coincide with the change to this system or a platform change or a new release of some other system on the shared infrastructure?
The output of this stage is a short list of possible causes.
Testing is where you narrow down the shortlist to one probable cause. This can be the lengthiest stage because it’s often about manually going through the steps that the system takes, trying to replicate the exact cause from your shortlist. It can be things like;
- Making HTTP requests on the command line to see the details of the response from a web application or API. Tools like curl can show a verbose output of the whole HTTP exchange, which can help narrow things down, or
- Maybe it’s logging into a database using the credentials of the service to see if there’s a permissions issue, or
- Tracing SQL statements that the app generates, or
- Running a traceroute to see which network paths the traffic takes.
You should note down every step you take, especially if you make any changes to the live system so you know what to revert if they don’t have any impact.
Negative results are fine here.
It may be frustrating when you think you’ve got the answer and it turns out not to be the issue, but it means you can definitively rule something out and move on. The key thing here is to start simple and try to rule out things that are quick to test.
The output of this stage is to narrow down your short list of causes to one highly likely cause. Some issues are all but impossibly difficult to replicate, race conditions or failures which only happen with a specific set of data.
So if you can rule everything else out, it might be that the best you can do is say I’m pretty sure that this is the problem.
The Curing Stage
The final stage is the cure, which is where you apply the fix with a good degree of certainty that it will address the issue, and you document the steps.
The output could be a fully working system if you could identify and fix the issue, or it could be a mitigation, which is good enough for now and enough detail to understand what the long‑term fix needs to be.
In the worst case, it might be mitigation without a clear understanding of the real fix. And in that case, the output should also include must‑have instrumentation requirements.
If your investigation wasn’t successful, you can at least say what extra detail you needed, more logs, and more metrics.
If the problem does persist, the next investigation has a better chance of getting to the root cause. When the investigation is done and the incident is closed, the final output is a postmortem.