How to effectively Identify and Measure Toil as Site Reliability Engineer
The outcome must be worth the investment
Identifying and measuring toil is the first step to addressing it effectively. Data is your friend here because you need to put together a case for removing the toil.
We should automate this because it’s boring isn’t really a good justification. Especially if the boring task takes 5 minutes once a week or a month and the automation would take dozens of hours of engineering.
What you’re aiming for is a backlog of toil‑fixing projects, which are estimated and prioritized in the same way that the product development team has their prioritized backlog for features.
SREs can pick up a small project when they’re not on call, or larger projects might be important enough that the whole SRE team works on them for an extended period. That prioritized list needs to be data-driven to give you a realistic cost‑benefit analysis.
How much time are we currently spending on the toil versus how much time do we need to invest to remove the toil?
You can use your existing tools to help get that data. You don’t want to be manually recording how much time you spend on toil because that’s going to add to the toil.
Reporting on the time duration between starting and closing a ticket is a simple and reasonable place to start. It will tell you how much effort is involved for tasks that are done from start to finish in one go. Also, give you elapsed time for tasks that are stop‑start because they involve multiple steps or communication back to the user or to other teams.
Effort is a more useful metric, but elapsed time is a reasonable proxy if you can’t get more detail. It gives you an idea of the complexity of the task, and tickets that are open for a long time mean more work in progress and more context switching.
Let’s look at an example. Onboarding is a good one. When a new developer joins the product team, they need various accounts and permissions set up, and that might be an SRE job if it includes access to systems in production. Before optimization, that’s going to mean;
· Logging into various systems,
· Setting up authorization rules, or
· Assigning users to roles.
Maybe going back and forth to the dev team to ask what permissions are needed, sending out status updates, or replying to chase emails.
Fairly obviously toil, but is it worth trying to fully automate it?
That’s going to be a project to build a workflow that can call into the APIs for all these systems to set up access. And that workflow needs to run in a platform that can interface with the ticketing system and the email system for notifications.
Potentially, that’s weeks of engineering effort building hundreds of lines of code, and the outcome is two things.
1. An automated onboarding process with no toil, and
2. A new software project because automation engineering is still engineering.
Now you have a custom onboarding workflow with hundreds of lines of code, which needs to be supported, maintained, and updated any time the systems or the access rules change.
So, is it worth building that workflow?
Well, that depends on how much time gets spent on the toil in the first place. If you can identify that onboarding a user manually costs 1 day of SRE time and you have a project team with 12 new hires a year (i.e at least 1 new hire per month), it may not be worth taking on a 30‑day project to automate that job. It’s going to take you at least three years to see a return on investment, and that’s assuming you don’t need to spend more time tweaking the workflow.
But if your product dev team is a place where interns start and you have 10 new hires a month, that means you’re spending 10 days of SRE time on toil every month. Then the automated workflow is going to be a good investment. Or even if you only have 10 hires a year, but you can build a generic workflow that also fits for 5 other SRE teams who also have 10 hires a year, then the automation is justified in that case too.
Now, days aren’t the only cost metric. If you have toil which involves repairing systems to bring them back online, you may be able to put a dollar figure on the outage, and similarly, where you have to overprovision capacity to allow for maintenance. Whatever data is easiest for you to collect and represent as the cost of toil helps you to build that benefit analysis.
When you have that understanding of what your toil costs you, you can make an informed decision about how to address it.