How do you keep track of the actual Service Level Objectives
Service health is defined in terms of multiple service level objectives, SLOs, which are user-focused rather than operations-focused.
There are crucial questions for making sure your services run well enough without excessive maintenance effort, for example;
- What service level are your apps currently running at?
- What service level does the business expect?
- How do you monitor the actual service level?
- And what do you do if it falls below the expected service level?
These give you explicit guidance on how you should define target service levels, how to monitor the actual service levels, and what to do when your apps experience a loss of service.
Operations have typically used service level agreements, SLAs, to rate system health. They might have a stated target of 99% availability for their service, which sounds good, but raises some pretty important questions.
- From the business, why only 99%? If you can do 99%, then surely you can do 100% and keep the system available all the time.
- From IT management, what happens if you breach the SLA? How do we know if it drops to 98%, and how do we get back within tolerance?
SRE does things differently. Service health is defined in terms of multiple service level objectives, SLOs, which are user-focused rather than operations-focused. So a generic 99% availability SLA is replaced with specific targeted SLOs, like:
- 99.9% of requests will be handled successfully, — SLO 1
- 90% of responses will be returned within half a second, — SLO 2
- 99% of responses will return within 2 seconds. — SLO 3
These SLOs are tracking user experience with the goal of defining a level of reliability that is good enough to keep customers happy.
If the app starts generating errors or takes a long time to respond, then users will log complaints or switch to another product. The SLOs are meant to define the thresholds where you need to act before that happens. That’s always going to be below 100% because 100% isn’t practically possible.
Let's put some practicality, take a 28-day measurement window. There are 40,320 minutes in that period where you’re monitoring response times. A target of 99% gives you 403 minutes outside of the SLO before you breach (that’s 1% of 40,320). That’s around 7 hours each month when your response time service level might be unacceptable to users, which does seem like a lot. But three nines is 99.9%, and that gives you just 40 minutes a month (that’s 0.1% of 40,320). Four nines only allow you 4 minutes a month, and five nines is just silly.
Modern distributed systems have multiple moving parts which run across multiple internal and external services. For most applications, four nines is very difficult to achieve and very expensive.
You might be looking at dual-running services across multiple clouds or hybrid cloud and data center deliveries and keeping live data in sync globally. That’s going to hugely complicate your architecture and significantly increase your running costs. And it’s probably not necessary to have those stringent service levels.
If a web page takes more than a few seconds to respond, users will just hit F5. The next request will likely hit different servers in different locations and come back just fine. Occasional issues don’t affect the vast majority of the user experience, so they’re not likely to be worth the investment of adding another nine.
Error Budgets
SLOs need to be achievable targets, and they’re agreed between stakeholders, the business product owner, the dev team, and the SRE team. What the SLO gives you is a window where it’s okay to be out of tolerance, and that’s your error budget.
The error budget is simply 100% minus your SLO. So if an SLO target is 99.9%, then you have an error budget of 0.1% over the period, and that’s your 40 minutes over 28 days. You need the error budget so you can comfortably change the system. The vast majority of service issues are caused by an update or a configuration change, so the error budget effectively controls the release velocity of the product. That’s why it needs to be a three-way agreement between the product owner, the dev team, and the SRE team.
A new product might have a lower SLO because the business wants to move fast and experiment. That means lots of releases, which come with lots of risks, and they’ll have to accept the lower SLO because that’s what the dev team needs to keep moving.
Long-established products with lots of happy but demanding users might have a much tighter SLO. With the business agreeing to minimize new feature releases in favor of keeping the application stable. Too many issues will exhaust your error budget for the period. When that happens, there needs to be an agreed policy to get the service back into tolerance.
In SRE, the error budget policy is a formal document that defines what happens when there’s a breach in the SLO. Again, this is agreed between the business and the IT teams, and the policy specifies how to react to try and maintain the SLO.
There are a couple of ways to define what happens at any point where there is a breach of the Error Budget policy.
- If there was a minor breach in SLO at the end of a period, it could mean reprioritizing feature development. So the dev team also work on reliability issues in the next period.
- If the SLO gets breached mid-period, it could mean a feature freeze. So dev can only work on reliability issues for the rest of that period.
- And a serious SLO breach might mean a complete change freeze. So there are no new releases for that period except for critical security patches.
Between the SLO and the error budget, SRE gives you a structure to keep users happy and give the product the velocity that the business wants. The error budget policy is the contract that ensures that you can balance those concerns.