The 2024 List of Incident Management Resources
· 5 min read
Introduction
This article is an attempt to list the best incident management material and guides available for free on the internet. If I've missed something you think should be here, do let me know and I'll be happy to add it.
- Introduction
- Reading Material (Articles, Guides, Slides, Books)
- Templates and Real-World Examples
- Conference Talks
- Podcast Episodes
Reading Material (Articles, Guides, Slides, Books)
- PagerDuty's incident response documentation This is a publicly available version of PagerDuty's internal training for their own employees. It's presented in a slides format with text along the side. PagerDuty has other useful resources at https://response.pagerduty.com/ . Although it's targeted at their own employee training, the principles are useful and can be applied to your own team. It's open source.
- Incident Management at Atlassian
- The Google SRE book chapters on being on-call and responding to incidents
- The Google SRE workbook chapters on on-call and incident response
- The Negotiability of "Severity" Levels. Also see the talk "SREcon24 Americas - What Is Incident Severity, but a Lie Agreed Upon?"
- Lessons learned in incident management (Dropbox)
- Incident Review and Postmortem Best Practices An article based on a survey of incident handling practices across different companies.
- Markers of Progress in Incident Analysis How to measure whether your org is learning from incident analyses.
- Incident Management: The Complete Guide (Splunk) An overview.
- The Incident Lifecycle: How a Culture of Resilience Can Help You Accomplish Your Goals "How to apply resilience throughout the incident lifecycle in order to turn incidents into opportunities".
- GitLab Incident Management
- AWS Security Incident Response Guide
Templates and Real-World Examples
- Post-mortem template
- AWS Incident Response Runbook Samples
- GitLab On-Call Runbooks
- Post-mortem Templates List
Conference Talks
- SEV0 2024 | Maintaining blameless incident culture when everyone knows whodunnit - Focusing on improvements rather than blame.
- SEV0 2024 | Stop, Drop, and SEV4: Why small incidents are a big deal - On how invetigating low severity incidents can improve overall incident response.
- SREcon21 - Evolution of Incident Management at Slack - An account of how Slack developed its incident management program.
- LISA18 - Incident Management at Netflix Velocity - Netflix's strategy for managing incidents.
- SREcon17 Asia/Australia: Measuring the Success of Incident Management at Atlassian - How Atlassian developed its incident management process and the challenges faced.
- CLL 2019 - Eliza Binette and Beth Adele Long: How to be a great Incident Commander - On the traits that make a great Incident Commander.
- SREcon23 Americas - An Organizational Response to Incidents: Designing for Smooth Coordination - On moving to a more collaborative approach in incident response teams.
- Monitorama PDX 2024 - Incident Management: Lessons from Emergency Services - A study of how lessons from incident management in emergency services can be applied to software incident management.
- Monitorama PDX 2023 - Stress, OnCall, and You - The consequences of stress that come from on-call situations and how to mitigate.
- Monitorama PDX 2022 - Meaningful Measurements: Lessons from Outside of Tech - Distinguishes between monitoring and measurement with examples from aviation disasters.
- Monitorama PDX 2015 - Incident Management and the Incident Complexity Framework - An introduction to a complexity-based framework to replace traditional models that depend on the service's state and metrics.
- How HashiCorp SREs Built HCP's Incident Management Program - Evolution of HashiCorp's incident management.
- Incident Analysis: How Learning is Different Than Fixing - John Allspaw - Focusing on learning from incidents rather than just fixing.
Podcast Episodes
- Incident.io's podcast - Dispelling the myths around incident response with Colette Alexander, Director of Engineering - Common misconceptions around incident management.
- Slight Reliability Episode 81 - Incident Management in Non-Prod Environments - The importance of treating non-prod env incidents with the same urgency as prod incidents.
- Finding a Common Language for Incidents with John Allspaw - Getting everyone - tech and non-tech folks - on the same page during incidents.
- Slight Reliability Episode 89 - Blameless Post-mortems with Karanveer Anand - On the importance and practice of conducting effective blameless post mortems.
- Slight Reliability Episode 72 - Rapid Incident Response with Valeska Victoria - On rapidly responding to incidents in a high-stress environment.
- Google SRE Prodcast - Incident Management with Adrienne Walcer - Managing incident response efforts as a continuous process.
- Google SRE Prodcast - Incident Response with Sarah Butt and Vrai Stacey - Incident response tactics focusing on tooling and software.
- Changelog - Learning from incidents (Interview) - Using incident analysis as a learning tool.
Disclaimer: I am not affiliated with any of the organizations or people mentioned in this article in any way.
Social share photo credits: Federico Beccari on Unsplash