Site Reliability Engineering: What is a Toil?

Image from dreamstime

You know that feeling when you have a constant stream of work coming in and you keep powering through it. But it feels like you never get anything done. That type of work is toil, and SRE has explicit practices and guidelines to help keep it to a minimum.

Toil has a negative impact on people and systems. You’ll understand the tools that SRE gives you to address it. We call it work, but, of course, there are different types of work. Some tasks are necessary admin functions like;

· Going to meetings,

· Filling in time sheets,

· Writing down what you’ve done this week.

That’s all-overhead work. It’s not specific to your role, but it’s needed for the smooth running of the team. So, hopefully, you’ve optimist that time anyway to the point where it’s a fairly static and known amount.

There’s reactive work where you have to do something as a response to some trigger, like;

· Answering emails or instant messages,

· Working on tickets for non‑urgent incidents, or

· Dealing with pages for urgent issues.

Often, this reactive work is a tactical quick fix or a workaround for a problem. It’s often repetitive, and those are clear indicators of toil.

There’s strategic work, where you have time to invest in improvements. Maybe that’s architecture or using a new Infrastructure as Code tool to deploy resources or replacing your alert management system with something that does the intelligent grouping of alerts. That’s more likely to be engineering work, which makes for a long‑term improvement of the system.

Reactive Toil and Strategic Engineering

The work we’re interested in is the split between reactive toil and strategic engineering. The problem with a traditional ops team is that toil tends to expand to take up pretty much all of their time, so there’s nothing left for strategic investment. Everyone is constantly working in reaction, which means the systems don’t improve, which means there’s even more toil coming down the line.

SRE explicitly addresses this with a guarantee that engineers will spend no more than 50% of their time on toil. That’s part of the job description. It’s the fundamental difference between SRE and traditional ops. The toil won’t expand, and you’ll always have time for high‑value strategic work.

We’ll look at how to make that happen in my next article, but first, let’s be clear about what toil actually is. We’re talking about labor‑intensive repetitive tasks. These sorts of things can be automated because they follow a strict pattern or because they don’t require human experience or understanding.

So, imagine a recurring production problem. Maybe you manage an internal system that has a memory leak, and periodically it stops responding. When that happens, users raise a ticket, which an ops person picks up, logs in to the server, and restart the server process. That’s manual, repetitive, and automatable, so it’s toil. And this example has the other traits of toil too;

· Its reactive,

· It doesn’t improve the system,

· It’s low value, and

· It scales with the system.

So, as you add more servers, the frequency of logging in to reset processes also increases.

Classifying toil can be subjective because people might see value differently or they might have very different ideas of what is automatable. But these are the common indicators that you’re dealing with toil. And a good guideline is a task where you’re sending in a human to do the work of a computer.

Now, let's compare toil to the engineering work, which is part of the SRE role. The strategic planned work aims at improving the system overall, and it requires human judgment. Tasks like designing for scale or building tools, which are outside of the capabilities of computers at the moment.

You can break that down further and look at two types of engineering which are part of SRE;

Systems Engineering, which includes configuration and tuning for operating systems and application platforms.

Software Engineering, which includes working on the application code, not product features, but things like reliability and scalability features.

In the example with a memory leak, an SRE could have a few options to remove the toil of resetting that server process.

They could choose to run the app in a container and add a health check so the container platform, Docker or Kubernetes, would restart the app automatically when it stopped responding. That’s a system engineering task, which configures how the app runs in production. It doesn’t fix the problem, but it removes the toil around it. And could provide other improvements around scalability and portability.

Less toil means more engineering time, so then the SRE could get to work on the app itself and track down all the memory leaks to permanently fix the issue. That’s a software engineering task, and although it might be the right approach. It could be a very much larger task than just moving to Docker and adding the health check. And it could mean fundamental changes to the application, and that comes with a lot of risks. But either of the engineering options improves the system overall and its high‑value work.

Once it’s done, a piece of toil goes away, freeing up more time for engineering. This is one of the major benefits of SRE.



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store