Skip to main content

A Beginner's Guide To Service Discovery in Prometheus

· 11 min read
Hrishikesh Barua
Founder @IncidentHub.cloud

Introduction

This article is part of a series on setting up an end-to-end monitoring and alerting stack using Prometheus.

Service discovery (SD) is a mechanism by which the Prometheus monitoring tool can discover monitorable targets automatically. Instead of listing down each and every target to be scraped in the Prometheus configuration, service discovery acts as a source of targets that Prometheus can query at runtime.

Service discovery becomes crucial when there are dynamically changing hosts, especially in microservices architectures and environments like Kubernetes. In Prometheus parlance, service discovery is a way of discovering "scrape targets".

For example, pods are created dynamically in Kubernetes as a result of new services being deployed and undeployed, autoscaling events, and errors causing pods to crash and go away. If you are using Prometheus for scraping pods in such an environment, Prometheus has to know which pods are running and scrapable at any given point in time. The Kubernetes service discovery pluging enables this. Similarly, there are SD plugins for other common environments.

You can use service discovery in Prometheus with the predefined plugins, or write your own custom ones using file or HTTP, depending on the situation.

Prometheus logo

The No-Nonsense Guide to Runbook Best Practices

· 9 min read
Hrishikesh Barua
Founder @IncidentHub.cloud

Introduction

Runbooks are a key part of incident management and preserve institutional knowledge. They can be used for both incident response as well as routine tasks like db maintenance and generating a complex report. We are mostly focused on incident response runbooks here.

Runbooks are a checklist

Best Practices

1. Runbook Structure

  • Establish a standard format that will be used across your organization. This will ensure consistency and help on-call folks to quickly figure out the steps even for runbooks they may not have seen before. It will also help in editing and maintaining the runbooks.
  • Get buy-in from your team on the decided format. If you don't have buy-in people might not want to maintain or use them.
  • Create the runbooks as decision trees. You don't need a visual guide here but include it if it's easy to create. Don't have too many branches in the tree - that will cause confusion. If you find yourself adding

The Ultimate List of Incident Management Tools in 2024

· 7 min read
Hrishikesh Barua
Founder @IncidentHub.cloud

Introduction

Incident management tools are important for organizations to effectively handle service outages. With so many incident management tools around with different feature sets, it's often difficult to find the one that is right for your needs. In this article, we attempt to make a list of incident management software available in 2024 with their features to help you arrive at the right one.

We have focused mostly on tools that offer incident management capabilities - which include at least incident lifecycle management, on-call scheduling, and third-party integrations.

There are many good tools which are focused only on incident response, or on monitoring and generating alerts, or on the ticketing aspect of incidents. We have not included those to avoid cluttering this article.

Incident Management Tools

Benefits of Using an Incident Management Tool

  • An incident management tool streamlines the incident management process by helping to define and automate workflows. It can help you create runbooks, alerting and escalation policies, and define and manage on-call schedules.
  • Incident Management software often come with integrations with your observability stack. Your observability stack is a key source of incidents. They can also integrate with your existing communication and

The Rising Role of Slack in Incident Management

· 4 min read
Hrishikesh Barua
Founder @IncidentHub.cloud

Introduction

Why is Slack becoming so popular in incident management?

Slack is one of the most popular communication tools used in companies. If you're part of a remote team, your team is probably on Slack or something similar like MS Teams. Although IM tools lack the communication nuances that are taken for granted in face to face interactions, they provide many other advantages:

  • Access to historical data
  • Asynchronous communication
  • The ability to share links and documents easily
  • Adding anybody in the organization to a conversation
Slack in incident management

Slack in Incident Management

One of the trends I've noticed in incident management is the growing rise of Slack in incident response and management tools. I think this is tied to the increase in remote work after COVID-19.

The 2024 Guide to Open Source Status Page Providers

· 7 min read
Hrishikesh Barua
Founder @IncidentHub.cloud

Introduction

Maintaining transparent communication about service availability is crucial for businesses of all sizes. Status pages are an important part of your communication strategy during times of outages and maintenance events.

You can choose to go with a fully managed status page provider, or host an open-source one yourself.

Open source status page providers offer a cost-effective and customizable solution. However, then can come with their own drawbacks. This guide explores open source status page providers in 2024 to help you choose the right tool for your needs.

List of Open Source Status Page Providers

1. Cachet

Cachet is a popular open source status page system built with PHP and Laravel. It offers a clean, minimalist design and a robust feature set.

Best Practices for Choosing a Status Page Provider

· 6 min read
Hrishikesh Barua
Founder @IncidentHub.cloud

Introduction

Downtime is inevitable but what sets successful businesses apart is how they handle it. A key part of incident management is incident communication with both internal and external stakeholders. A status page is a crucial tool for maintaining clear communication with users during outages or service interruptions. There are numerous status page providers available with different features. This article will guide you through best practices for selecting a provider that suits your needs.

GitHub status page

The Importance of a Status Page

An internal status page provider your colleagues and stakeholders in your organization to get a snapshot of of the current status. It can help reduce unnecessary back and forth between teams, and help people to prioritize their work better. It also creates internal transparency and trust between teams.

An external status page is crucial if you say you are committed to open communication with your end users or customers. Whether you are B2B or B2C, a public status page would be the first thing people would check if they face issues. Being open about incidents and your efforts to mitigate them build user trust. They can also decrease support ticket volume during incidents.

You can choose an open source status page provider, or one that is managed. This guide focuses on the factors to look at while evaluating managed providers.

Key Factors to Consider When Choosing a Status Page Provider

1. Reliability

Your status page needs to be accessible especially when your main services are down. Your provider should be able to guarantee a reasonable amount of

  • Uptime SLA
  • Globally distributed infrastructure for high availabilty
  • Redundant systems to ensure failover and availability
  • Scalability to handle increased traffic during major incidents

Integrate Incident Alerts Into Your Slack Workspace

· 3 min read
Hrishikesh Barua
Founder @IncidentHub.cloud

Introduction

Updated Mar 26, 2025

Staying on top of your third-party Cloud and SaaS service outages is crucial to maintaining the reliability of your own applications. Like many modern teams, Slack might be your communication tool of choice. You can keep up with such incidents by pushing these events to a Slack channel.

IncidentHub has its own Slack app which can be used to push incident lifecycle events to the Slack channel of your choice. It can be used to send incident trigger, update, and resolve events.

Installing IncidentHub's Slack App

You must have the correct permissions on your Slack workspace to be able to do this.

Follow these steps to configure the Slack app in your Slack workspace.

How To Monitor Public Status Pages of Cloud Providers - a Step-by-Step Approach

· 8 min read
Hrishikesh Barua
Founder @IncidentHub.cloud

Introduction

Incident updates on the public status pages of your cloud providers are often the first indication that they might have an outage. Providers also post updates about upcoming and ongoing maintenance on their status pages. Thus, monitoring your cloud status pages becomes crucial to your business operations. This article will guide you through the process of effectively monitoring such status pages.

Identify Your Cloud Providers

Work with your Dev/Ops/SRE and IT teams to come up with a comprehensive list of your cloud providers. Any service that is not managed by your teams is by definition a cloud service. Although we focus on Cloud providers - i.e. providers that let you deploy your services on their infrastructure - this article is equally applicable to any of your external SaaS vendors.

Integrate Incident Alerts With Discord Using Webhooks

· 4 min read
Hrishikesh Barua
Founder @IncidentHub.cloud

Introduction

Staying on top of your third-party Cloud and SaaS service outages is crucial to maintain the reliability of your own applications. If Discord is your communication tool of choice, you can keep up with such incidents by pushing these events to a Discord channel.

Discord webhooks allow external applications to send messages to specific channels within a Discord server. This article describes how to integrate Discord as a channel in your IncidentHub account using webhooks.

Note that IncidentHub also has an option to integrate with custom webhooks, which is different from Discord's webhooks. If you are using Discord, choose the Discord option. For a custom webhook server, choose the Webhook option.

Discord Server Webhook Configuration

You must have the correct permissions on your Discord server to be able to do this.

A Step by Step Guide to Checking if a SaaS is Down

· 6 min read
Hrishikesh Barua
Founder @IncidentHub.cloud

Introduction

Modern businesses depend heavily on Software as a Service (SaaS). SaaS is not limited to being used by software development teams.
Almost all aspects of business operations - accounting, HR, payroll, marketing, IT, sales, support - depend on one or more SaaS applications. Given this dependency on SaaS applications, their uptime becomes tightly tied to a business's uptime. Any SaaS downtime can affect both a business's daily operations as well as the user experience.

How to check if a SaaS is experiencing downtime? Follow the steps below:

Visit the SaaS Provider's Status Page

The SaaS provider's status page will have first-hand information about ongoing issues.