How to Setup Multi-burn rate Windows Alert on Service Level Objectives
The burn rate is a calculation of how fast an issue is burning through the error budget.
The concept of alerting is pretty simple. When an SLI tells you you’re consuming an error budget, you need to get a human involved to protect your SLO. The mechanics are fairly straightforward too.
In my previous articles, I have discussed How to track Service Level Objectives, Site Reliability Engineering: SLI Implementation Example, and Site Reliability Engineering: Which metrics help to measure SLI?
Your monitoring system is constantly measuring SLIs, and you can capture your SLOs in alerting rules. You extend your monitoring to generate alerts when the numbers from the SLIs look like they might threaten the SLO. The alert will be a ticket if the situation needs looking at but isn’t urgent and a page if the situation is urgent.
But beyond this, alerting gets complicated pretty quickly because you’re dealing with statistical functions over time windows. How you define the alerting thresholds will have a big difference on:
- Whether you generate timely alerts that allow SREs to diagnose and fix problems before they deteriorate,
- If you generate alerts that are too reactive so the SLO is already in danger before they fire, or
- If you generate tons of noisy alerts, which just get ignored.
SRE talks about four attributes that affect your alerting.
- Precision is about keeping the signal-to-noise ratio high. A precise alert is triggered by a significant event, and 100% precision means every alert was triggered by a significant event.
- Recall is the other side of that, how many significant events actually triggered alerts. 100% recall means every significant event triggered an alert.
- Detection time is how long it takes for a significant event to trigger that alert, and
- Reset time is how long alerts continue to fire after the significant event has died down.
These all have a close relationship. Increase the detection time, and you should add precision because you’re waiting to be sure that an event is really significant. But you might lose recall because events that are shorter than the detection time don’t trigger alerts. I’ll run through a couple of examples so you can see how these parameters affect your alerting and why this gets complicated.
So this example is an SLO for a success rate of three nines (99.9%), which means that we need to keep errors below 0.1% over the period.
We have an SLI measuring error rate, so the simplest alerting setup is just to trigger an alert every time the error rate hits the SLO threshold. Rates need to be calculated over a time period, and we could use a short window, say 10 minutes, to make sure we have fast detection.
If the error rate steadily increases to 1%, then the alert will trigger in 1 minute. We can expect most events to trigger alerts using a short window like this, so recall will be good. But precision is very poor with this setup because it will fire with events that are not significant at all.
Let’s add some numbers to follow that through. We’re using a 28-day period, and we expect to serve 1 million requests in that time, so our three nines give an error budget of 10,000 error requests.
Now assume we have a completely even distribution of requests. Of course, we won’t, but that’s the simplest way to look at the problem here, and that means about 25 requests per minute.
If the error rate rises to 1% over 1 minute, the alert will fire after 2.5 errors, which is 0.025% of the error budget for the period. The resolution of this alert is all wrong. It will fire after a tiny proportion of the error budget has been used, even though it looks like we put together alerting parameters tied to the SLO.
The SRE Workbook goes through a whole set of examples of badly tuned alerts, that are all variations of this problem, and then it suggests a preferred solution, which is alerting based on the burn rate of the error budget.
The burn rate is a calculation of how fast an issue is burning through the error budget. For a 28-day period with a three-nine SLO, the burn rate is 1 if you have a constant error rate of 0.1%. That’s a steady burn of the error budget, which leaves it completely gone at the end of the period.
If the error rate is at a constant 1%, the burn rate is 10, and the error budget will be gone in 2.8 days. If you have 100% error rate, the budget is all gone in 40 minutes. You use the burn rate to calculate your alerting parameters.
The most sophisticated strategy here uses multi-burn rates over different time windows. So a high rate of errors that comes in a burst will trigger a page alert, and low but sustained error rates will trigger a ticket. That gives you the best compromise of precision and recall.
The short-time window alerts will fire on every significant event, and the long-time window alerts will ensure that all events get a notification. You need to work out the levels of burn rate for the windows that you are happy with.
This example shows you’ll be getting tickets for sustained low-level error rates, tens of hours at a rate of 0.1%, and pages for bursts of high-level error rate in around 10 minutes at 10% error rate. The burn rates are the numbers you plug into your Prometheus query, which triggers the alerts.
The downside to this sophisticated level of alerting is the complexity of managing the rules. It really does work in practice though. A concrete statement like this shows how your alerting gives you a clear response to events that threaten your SLO, and it’s the end result of the hard work to build out and test the alerting parameters.