Skip to main content

3 posts tagged with "incident-management"

IncidentHub posts related to incident management

View All Tags

The Rising Role of Slack in Incident Management

· 4 min read
Hrishikesh Barua
Founder @IncidentHub.cloud

Introduction

Why is Slack becoming so popular in incident management?

Slack is one of the most popular communication tools used in companies. If you're part of a remote team, your team is probably on Slack or something similar like MS Teams. Although IM tools lack the communication nuances that are taken for granted in face to face interactions, they provide many other advantages:

  • Access to historical data
  • Asynchronous communication
  • The ability to share links and documents easily
  • Adding anybody in the organization to a conversation

Slack in incident management

Slack in Incident Management

One of the trends I've noticed in incident management is the growing rise of Slack in incident response and management tools. I think this is tied to the increase in remote work after COVID-19.

COVID-19 saw a tremendous increase in the usage of Zoom, Slack, Google Meet and similar tools. Remote work increased post COVID-19, and the tools evolved to support this. A natural consequence of a bigger remote workforce was more workflows moving to remote communication tools. The tools themselves evolved as platforms and there were other tools built on top of them. Incident Management is one such workflow that has benefited.

Benefits of Using Slack in Incident Management

  • Incident lifecycle events are easier to share and analyze on such a platform. Sharing a dashboard URL, PagerDuty event link, Git commit link, link to a log file from your observability stack - these are all easy to paste in your collaboration tool.
  • Communication is more streamlined. You can create dedicated incident channels and use threads to organize discussions.
  • Integration with incident management tools - Slack has an extensive ecosystem of popular incident management platforms that integrate with it. These also include ticketing systems like Zendesk.
  • Improved visibility - It's easier to post status updates, share the results of Root Cause Analyses (RCA), share debug logs and screenshots - because everyone in your org is on Slack (or whichever tool it is). Anyone can check progress without having to be rebriefed.
  • Faster response times from on-call folks.

This explains why so many incident response and management tools are either being built using Slack as a foundation, or have tight integration with Slack.

Tools That Use Slack for Incident Management

A non-exhaustive list of such tools and their features:

  • Pagerly - Manage on-call, ticket creation, incident lifecycle all within Slack
  • Incident - Lets you setup a dedicated channel per incident using a single command and manage it from there
  • FireHydrant - Integrates with Slack and lets you manage incidents from there
  • OpsLane - Operates directly in Slack channels, provides additional info and debugging resources
  • PagerDuty - You can trigger/ack/resolve incidents directly in Slack and create on-demand Slack channels for incidents
  • OpsGenie - Bidirectional integration allowing you to manage incidents
  • BetterStack - The entire incident lifecycle management in a dedicated Slack channel
  • Rootly - Create dedicated Slack channels, manage incident lifecycle events and on-call schedules
  • Zenduty - Fetch alerts, create and assign incidents based on on-call schedules

Features of Such Tooling

These tools are not limited to lifecycle management. They also add context to the incident:

  • Links to runbooks
  • Pulling in data from your infra and service catalogs
  • Relevant log entries from your observability systems
  • Key metrics that might be related to the incident
  • On-call information including schedules and escalation policies
  • Related incidents in other services
  • Related incidents in third-party services

This is an evolution of the ChatOps model which created chat tooling used for devops collaboration and automation tasks.

The next level of incident management is AI-driven incident response agents, where the AI agent takes the first shot at figuring out the root cause of the incident, and proposes mitigation steps. There are already a few of those - but it will be a while before they are mature enough.

Conclusion

Slack and other IM-based communication tools are playing an increasingly important role in the incident management process. Popular incident management platforms have tightly integrated with Slack, and some of them are built entirely using Slack as a base. The future will likely see the maturing of AI-powered incident response tools.

Photo credits: Stephen Phillips - Hostreviews.co.uk on Unsplash

A Step by Step Guide to Checking if a SaaS is Down

· 6 min read
Hrishikesh Barua
Founder @IncidentHub.cloud

Introduction

Modern businesses depend heavily on Software as a Service (SaaS). Almost all aspects of business operations - accounting, HR, payroll, marketing, IT, sales, support - depend on one or more SaaS applications. SaaS is not limited to being used by software development teams. Given this dependency on SaaS applications, their uptime becomes tightly tied to a business's uptime. Any SaaS downtime can affect both a business's daily operations as well as the user experience.

How to check if a SaaS is experiencing downtime? Follow the steps below:

  1. Visit the SaaS Provider's Status Page
  2. Use External Monitoring Services
  3. Check Social Media
  4. Run Manual Tests
  5. Incident Communication
  6. Conclusion
  7. FAQ
  8. Popular SaaS Service Statuses

Visit the SaaS Provider's Status Page

The SaaS provider's status page will have first-hand information about ongoing issues.

The Benefits of a Single Incident Management System

· 2 min read
Hrishikesh Barua
Founder @IncidentHub.cloud

How many monitoring tools do you have?

Chances are at least 2-3. One tool usually does not cover all cases, and it’s usually a combination of self-managed and managed tools. Self-managed gives you more control over custom configurations and cost. Managed ones take away the headache of running it yourself.

Prometheus is the de-facto standard for monitoring these days if you have a modern application stack and you want to manage your own monitoring. It is metrics-based, i.e., it uses metrics as the source of data from all the monitored systems. There are ready-made exporters for almost all popular infrastructure components. You can send your application and business metrics to Prometheus too with OpenTelemetry exporters.

This model does not work for all aspects of your service. E.g. If you want to monitor external properties like your website, or use synthetic monitoring to check your customer-facing APIs from global locations, you could use something like Pingdom or UptimeRobot. This becomes another source of data about your service's uptime.