Site Reliability Engineering: Which metrics help to measure SLI?
SRE recommends a baseline set of metrics to monitor called the four golden signals.
Monitoring is the automated collection of data from your systems which feeds your SLIs (Service Level Indicators) and tells you if your SLOs (Service Level Objectives) are on track. Actually, SRE recommends a baseline set of metrics to monitor called the four golden signals:
- Latency,
- Traffic,
- Errors, and
- Saturation.
These give you an overall gauge of system health, and they’re useful to put into place even if they don’t map to your SLOs. You can be building out the systems to collect, store, and visualize metrics using just the four golden signals in parallel with the design around your own SLOs and SLIs.
Latency
Latency is how long it takes for a job to be processed or for a request to get a response. You need to be explicit here about whether you want to measure all responses or just successful ones. Latency for a website should tell you how quickly users are getting responses back to their browsers.
Traffic
Traffic is the amount of load your system is responding to. That can be the size of a message queue for a job processing system or it could be the number of HTTP requests per second for a website or a REST API. Traffic tells you how hard your system is working.
Errors
Errors are responses that either fail or are incorrect. For web applications, measure that. The simplest might be from scraping the web server logs which you already collect. That will tell a simple count of HTTP status codes will give you an indicator of failures, but correctness is harder to measure and will need custom metrics in your apps or specific test cases. Errors tell you if your system is behaving correctly.
Saturation
Saturation is a measure of how much capacity the system is using, and it will cover different metrics like memory allocation, CPU usage, and network load. Saturation tells you how close your system is to being full.
Those four signals will give you a good overall view of your system health. And if you can also track them for any services your app depends on, then you’ve got a good starting point for monitoring. But your own SLIs are the real measures you need to capture because they’ll drive alerts when you’re failing to meet SLOs. It’s likely to take engineering effort from SREs to create those SLIs, and the more useful the data is, the more effort it’s going to take.
Let’s look at the example of latency for a web application. You might have a user-focused SLO that says the home page should return in under half a second for 95% of requests. And there are lots of different ways you could build an SLI to give you the time taken for the server to generate and return the response. But that’s not the same time period that the user sees because there will be load balancers, CDNs, and the whole internet before they get the response from the server.
You could use synthetic metrics instead with an external system that sends in fake user requests from outside your infrastructure and records the end-to-end response times. Services like Pingdom and StatusCake do this very well. They can be configured for different regions, and they show you a much bigger picture, which includes network latency. But the synthetic response doesn’t include rendering time in the browser, which can be significant if your home page loads data asynchronously from other services with JavaScript.
In that case, the only way to get a genuine idea of the end user latency time is to include monitoring in your JavaScript recording response and rendering times. This might need custom engineering effort, or you might get what you need from a service like Google Analytics, which records page latency together with traffic.
Your choice of SLI implementation is going to be a balance between how good the data is and how expensive it is to collect that data. If you currently have nothing, the best guidance is to start simple and improve over time.
Latency from server logs will give you a reasonable measure, and you can look to build out browser-level monitoring later on. Where you put all that data is usually a much easier decision.
Modern monitoring systems tend to use a pull model to collect metrics from your application components and use a time series database to store the metrics. Every metric is stored with the timestamp when it was collected, and that lets you query for specific points in time or for trends over a period.
Standards are emerging for how metrics should be represented, but the most common format is from the open-source project Prometheus, which does collection and data storage. The Google SRE books use Prometheus for their examples, and it’s a pretty big project. It’s part of the Cloud Native Computing Foundation, which is the same foundation that stewards Kubernetes and Docker’s container runtime.
Prometheus records metrics with labels, which is a way of representing the data at different levels. You can have a metric to record the count of HTTP requests and include labels for different response status codes. Prometheus then lets you query with or without label values so you can get a count of all responses for a period and a count of all error responses by using the same metric. That makes it easy to record and analyze your metrics without amassing hundreds of different data points.
The second thing is that Prometheus has native support for percentiles, so it’s very simple to record response times or job durations and then query them using functions in the Prometheus Query Language, PromQL, to get whichever percentile values you want to see. This data is all recorded as a time series too so you can query the rate of change over time.
You’re going to use your monitoring system to drive two things, alerts and dashboards. You should build dashboards as visualizations of your SLIs together with the four golden signals and some textual information about your software.
That text data should capture release versions and configuration timestamps so you can compare the metrics before and after a change. The four golden signals will tell you the basic system health, and the SLIs will tell you if you’re breaching or about to breach your targets.
The goal is to tell at a glance if something is wrong and hopefully get a head start on the analysis. So if your latency SLO is about to be breached and you can see that memory use has spiked since a recent release, that tells you maybe the new release has a memory leak that is slowing down response times enough to impact the user experience.
Dashboards are very useful for seeing the current status and recent trends, but the team won’t be sitting watching the dashboard all day long, so alerting is the other function of monitoring. SRE is quite clear on what type of alerts should come from the monitoring system, a page if a human needs to act now or a ticket if a human needs to act within a few days.
Hit the Follow button to see more content on Site Reliability Engineering and DevOps.