Site Reliability Engineering: Setting up the right Monitoring System
You need to know if something is going on with your application that affects the end‑user experience as soon as possible.
Monitoring is a foundational capability of a Site Reliability Engineering (SRE) team. You need to know if something is going on with your application, that affects the end‑user experience as soon as possible. Also, your monitoring should be able to help you in identifying the root cause as soon as possible. Here are the primary functions of monitoring.
· It provides visibility into service health.
· It lets you create alerts based on a custom threshold.
· You can use monitoring for trend analysis and capacity planning.
· It provides detailed insights into various subsystems that make up your application or service.
· It provides code-level metrics to understand behavior. Not all monitoring systems can provide this, but it remains one of the critical functions that a monitoring system should provide.
· Visualization and reports, this is what makes the analysis of monitoring data effective, and it helps you share what you found with others.
The question is… where do we tap to get the necessary data for an effective monitoring system?
Raw logs are logs coming out of your application, a server, or your network devices. They are generally unstructured, structured event logs, a classic example would be Windows Event Viewer logs.
Metrics are the gold standard for monitoring, it is a numerical measurement of a component.
When you’re choosing a monitoring system for your SRE team, there are four questions you can ask to determine the right tool.
1. Speed: speed refers to both how fast the data gets into the monitoring system and how fast you can retrieve the data from the monitoring system.
2. Resolution: this refers to the resolution of the data.
3. Alerting: the alerting capabilities that the monitoring tool can provide.
4. The user interface, how rich and versatile the user interface is.
Now let’s take a detailed look at each of these questions.
Speed: how fresh the data should be. The fresher the data, the better. You don’t want to be looking at data that is 2‑hours old. You want the data to be as real‑time as possible. Ingesting data and alerting real‑time data, however, can be expensive. You may have to invest in platforms like Sumologic, Splunk, or InfluxDB to implement this. Consider your Service Level Objectives (SLO) to determine how fast the monitoring system should be.
For instance, if your SLO is 2 hours, you do not have to invest in systems that process machine data in real time. Querying vast amounts of data can be inefficient. Again, you may have to invest in platforms like Sumologic, Splunk, InfluxDB, or Elasticsearch if you need very fast retrieval of data.
Resolution: Resolution refers to the granularity of your monitoring data. Ask the question, do you need to record data every second? My recommendation is to use aggregation wherever possible. You can also use sampling if it makes sense for your data. Metrics are suited for high‑resolution monitoring instead of raw log files.
Alerting: When it comes to alerting, ensure that the monitoring tool can integrate with many third‑party tools. For instance, can your monitoring system page someone using integrations such as PagerDuty or VictorOps? As another example, can your monitoring system integrate with ServiceNow to open a ticket? Your monitoring system should have the capability to classify the alerts based on their severity.
For instance, you may want to choose a severity level of 3 for a slow application versus a severity level of 1 for an application that is not available. When it comes to alerting, email flood or page flood can be very destructive to the on‑call experience. You must have a way of suppressing the alerts.
Interface: Does your monitoring tool provide rich visualization tools? Can it show time-series data, as well as custom charts? Can it be shared? And this is important because you may want to share what you found with business executives, leadership, and so on. Bonus, can it be managed using code? And this is a big one. You don’t want to be a full‑time monitoring administrator. You need to be able to manage your monitoring system via code.
Monitoring the hard resources
When it comes to monitoring the actual hard resources, here are my recommendations. You must monitor the CPU. In some cases, CPU utilization can indicate an underlying problem.
Memory needs to be monitored. Here, the memory points to both application memory and system memory. Application memory, for instance, could be the Java heap size in a Java application.
Disk I/O. Many applications are heavy I/O dependent, so it’s important to monitor the disk I/O. Disk volume. I have encountered many outages because the file system became full. You need to monitor the sizes of all your file systems.
Network bandwidth. This is something few organizations monitor, at least from the SRE perspective, but in my experience, this can be a game-changer. You have to monitor the network bandwidth utilized from your application’s perspective.
Effective Monitoring best practices
These are the best practices that you can adopt for an effective monitoring system in your SRE team.
Configuration as code. You want to be able to manage your monitoring system via code. One of the biggest advantages of this method is when you need to extend your monitoring to a new system or a new set of servers, for instance, it can be done really quickly.
Unified dashboards. In many organizations I have worked with, dashboards are out of control, and this does not help. You want to be able to converge to a unified dashboard pattern that enables us to reuse dashboards.
Finally, consistency. Whatever monitoring tool you use, the components that you create within the monitoring tool should follow a consistent naming pattern.
Hit the Follow button to see more content on Site Reliability Engineering and DevOps.