The No-Nonsense Guide to Runbook Best Practices
Introduction
Runbooks are a key part of incident management and preserve institutional knowledge. They can be used for both incident response as well as routine tasks like db maintenance and generating a complex report. We are mostly focused on incident response runbooks here.
- Introduction
- Best Practices
- Conclusion
- FAQs
Best Practices
1. Runbook Structure
- Establish a standard format that will be used across your organization. This will ensure consistency and help on-call folks to quickly figure out the steps even for runbooks they may not have seen before. It will also help in editing and maintaining the runbooks.
- Get buy-in from your team on the decided format. If you don't have buy-in people might not want to maintain or use them.
- Create the runbooks as decision trees. You don't need a visual guide here but include it if it's easy to create. Don't have too many branches in the tree - that will cause confusion. If you find yourself adding too many branches, think about splitting up the runbook into more than one.
- Each runbook should have a single purpose.
A simple, well-understood, and agreed upon format will greatly increase effectiveness and adoption of your runbooks in your teams.
2. Runbook Content
- Runbooks should be actionable. There should be clearcut instructions on what to do and how to interpret the results.
- Runbooks should be clear and concise. Trim any verbiage and unnecessary explanations from your runbooks. An incident is a high-pressure situation and you want only those details in the runbook that can help to diagnose and mitigate the problem.
- You can add an architecture diagram if it will help but keep it as simple as possible. Note that when the runbook is updated it's easy to miss updating images.
- Link to appropriate dashboards.
- Watch out for the curse of knowledge while writing a runbook. Users of your runbook may not be aware of details and assumptions that you make in the runbook. A good test of this is to have a new team member use the runbook in a mock incident response exercise.
- It's ok for a runbook to have manual steps. You can attempt to automate as much as possible.
- Add links to automation artifacts - running a script, clicking a button, running a command.
- Are there too many commands to be run in sequence without a gap? You can put them in a script, commit it to a repository, and link to the file in the repository instead of putting the commands in the runbook.
- Explicitly call out commands that can have side effects and change the system.
Writing a good runbook is a skill that comes with practice and experience. Test your runbooks as often as possible and you will end up improving them.
3. Updating and Maintaining Runbooks
- As part of your post-incident activities, go through the emails, chat logs, tickets logged and update your runbooks as needed. Nothing is worse than an outdated runbook which does not fulfill its purpose when the next similar incident occcurs.
- A post-incident service fix might also lead to an update in a runbook. Coordinate with your team(s) to ensure this happens.
- During an incident, any inaccuracies in the runbook should be noted and then corrected in the post-incident phase.
Updating runbooks to keep them accurate needs continuous effort. Make it part of your incident management process so that it happens automatically.
4. Testing Your Runbooks
- Test your runbooks at least once before you say they are ready for use. Do these from a "clean" machine - one that does not have anything installed. Any assumptions about access control - e.g. a link to a dashboard that works from your machine because you have access to it and are already logged in - will be tested. You will also know which CLI or other tools needed for the
runbook steps are not installed. You can simplify this step significantly if you have a tools setup process for new members' laptops, but don't skip testing.
- Have newly onboarded folks try out the runbooks. Any missing context will surface.
- Carry out regular mock incident exercises.
Test your runbooks regularly so that you can confidently use them during a real incident.
5. Locating Runbooks
- Runbooks should live in a central place accessible by everyone who are on-call. It can be an internal wiki, or your knowledge base software.
- Alerts should link directly to runbooks. E.g. in Prometheus alerts you can add the runbook as part of the description. On-call engineers can go to the runbook directly when they are paged.
- Improve findability via search
- Put the alert name in the runbook at the top. Mention clearly that this runbook is intended to be used when this alert fires.
- Name the runbook itself that is descriptive - "runbook-cpu-usage-critical-alert";
- Put keywords, e.g. the service name, in the runbook's title or description.
Your team should be easily able to find runbooks when needed. In addition to being linked from alerts on-call engineers might need to locate other runbooks to look at related issues. And as with everything else, test this too.
6. Runbook Ownership
- If you follow the "You build it, you run it" philosophy, the runbooks for a service are owned by the team than owns the service.
- For infrastructure or common components, it's usually the SRE/Ops team.
Practice collective ownership of your runbooks. To avoid the problem of nobody taking ownership:
- Post-incident updates should be done by the on-call engineer and reviewed by others.
- Updates as part of mock incident exercises can be done in a rotating manner so that everyone can do it at least once.
Apart from these there will be other updates as new information comes in, or new infrastructure or services are deployed.
7. What Not To Do
- Don't make your runbooks too generic.
- If there are too many steps in your runbook, split them into more than one. If you cannot split them it might indicate other issues - e.g. lack of observability into some part of your system.
- Do not have more than one runbook dealing with the same alert.
- Never store credentials in a runbook.
8. Dealing With the Unexpected
1. What to do when there is no runbook for an incident or a situation?
If you are not the service owner, find somebody who is and involve them in the process. If it's a service you manage, use the tools at your disposal to figure out as best as you can. Communicate this to your team. Note down whatever you did once you have diagnosed or mitigated the problem, and then create a runbook out of it.
Be careful of the impact of any command you run as your system might have complex interactions with other systems.
2. The runbook steps are wrong, or they don't work, or I don't have access, or the link is broken
- If you detect that the steps are wrong, it's best not to run anything from there unless you know the system inside out to avoid further damage.
- If you are familiar with the system, go ahead and diagnose, and then update the runbook later.
- Pull in somebody who knows the system and have them guide you, and update the runbook later.
Conclusion
Runbooks are not a replacement for a real human being investigating an incident. They provide a set of guidelines to diagnose, and if possible, mitigate the problem in a temporary manner. They can give you a clear path amidst confusion. In a high-stress situation like responding to an incident, runbooks can be a key part of your toolkit if they are written, maintained, and tested well.
FAQs
What is a runbook?
A documented set of procedures used for incident response and routine tasks to preserve institutional knowledge.
Should runbooks be automated?
Manual steps are ok but but automate where possible. For multiple sequential commands, use scripts instead of listing them directly and link to the script.
What should a good runbook include?
- Standardized format
- Clear, decision tree structure
- Single purpose
- Actionable steps
- Relevant links
- Simple architecture diagrams (if needed)
What should NOT be included in a runbook?
- Credentials
- Unnecessary or verbose explanations
- Too many decision branches
- Generic instructions
How often should runbooks be updated?
After incidents, service fixes, when inaccuracies are found, or when systems change.
How should runbooks be tested?
- On a clean machine
- By new team members as part of onboarding
- During mock incidents exercises
- Verifying all links and permissions
- Verifying all tools required
Who should own the runbooks?
Service teams own service-specific runbooks. SRE/Ops teams own infrastructure and common runbooks.
Where should runbooks be stored?
Runbooks should be stored in a central location (wiki/knowledge base) accessible to all on-call staff and linked from relevant alerts.
How do I improve runbook findability?
Include alert names, use descriptive titles, add keywords, and ensure proper linking from alerts.
What if there's no runbook for an incident?
Contact the service owner, use available tools to diagnose, document your actions, and create a new runbook afterward.
What if the runbook steps are incorrect?
Stop if unsure, contact system experts, note the inaccurate information, and update the runbook after resolving the incident.
Photo credits: Photo by Glenn Carstens-Peters on Unsplash