Site Reliability Engineering: SLI Implementation Example
The Service Level Indicator is the ongoing measurement of your system that tells you whether you’re meeting your objective
In my article on how do you keep track of the actual Service Level Objectives? I discussed ways to track your SLO, which is the target level of reliability.
Whereas, the Service Level Indicator, SLI, is the ongoing measurement of your system that tells you whether you’re meeting your objective. Measurement is the key point here. These have to be factors that you can automatically collect and objectively quantify. They’re still related to user experience, but they’re measurable.
You might record HTTP status codes and response times because they’re telling you the latency and error rate that your users are experiencing. Different types of services will have different SLIs. So a batch processing system might use the number of records processed successfully and the duration of the job. A request-driven system, like a website or REST API, will use availability and latency as the minimum.
- Availability is the number of successful requests, and
- Latency is the time taken to send the responses.
SLIs are best represented as a percentage so you can standardize the measure across different types of data.
Availability for the web application will be measured as the ratio of successful requests over all requests.
Availability = (Successful Requests / All Requests) * 100
So if you have 10000 requests over a period and only 9600 of them return a successful response code, the SLI is 96% successful.
Latency is best represented using percentiles so you can see the distribution of the data. That’s much more useful than a single average.
So if you have 5000 requests and the average response time is 1.5 seconds. That could mean every request takes exactly 1.5 seconds, or it can mean 4500 requests took half a second and the other 500 took 10 seconds. So percentiles will show the distribution and let you define a percentage SLI.
Here I can say that:
- 80% of requests had a response within half a second,
- 92% within 2 seconds, and
- 98% within 5 seconds.
You’ll use SLIs as the ongoing measure of the system, but you can also use them to help define your SLOs.
If you don’t have fixed reliability targets in mind, it can be useful to measure where you currently are and see if that’s a reasonable starting point for your objectives. Using the example above, I can say that the availability SLI is currently 96% success rate. Maybe that’s okay because we know some of the errors are not user-impacting. Or we don’t have that data and we need to do some analysis to see whether 99% or 99.9 is achievable.
My latency distribution is not very even, and I want to address that. The SLO might relax the lower bounds and tighten up the higher bound. So we could aim for 90% of responses within 800 ms and 95% within 2 seconds. That’s also going to need some analysis to see if it’s achievable before it gets adopted and agreed upon for the service.
Each SLO has its own error budget, and a breach of any of them will mean invoking the error budget policy. So these numbers need to be reasonable, but they also need to reflect the happy user experience.
You might decide not to launch your new SLOs if you know they’ll be breached in the first period. Or you might choose to do exactly that to bring the error budget policy into play straightaway and start improving system reliability.
The period your SLOs cover is also an important part of the agreement. A short time period will give you a limited error budget. But it can allow you to make multiple, fast corrections, so it might be a good choice for a new product that is experimenting with features.
A longer period won’t give you that fast feedback, but it does give you the scope for more extended pieces of work. So it might be more suitable for established products that need some extensive re-architecture.
The typical time period which SRE recommends is 28 days, which is what I’m using in my examples. It’s roughly a calendar month, and it provides scope for a reasonable amount of work. It’s also a constant period, four weeks including four weekends. Whereas an actual calendar month is a variable number of days, sometimes with four weekends and sometimes with five.
It’s difficult to get consistent monitoring over a variable period, especially if your services see a boost in traffic over the weekends. So try and stick with 28 days unless you have a good reason to change it.
Hit the Follow button to see more content on Site Reliability Engineering and DevOps.