Skip to main content

How To Decide Between Hosting Your Own Status Page Versus Using a Managed One

· 6 min read
Hrishikesh Barua
Founder @IncidentHub.cloud

Introduction

A status page forms a key part of your incident communication strategy. When it comes to setting up a status page, you have two options:

  • Host your own - using either an open source project or a custom solution.
  • Use a managed status page provider.

We will examine the pros and cons of each option along these dimensions:

  1. Feature Set
  2. Service Related

For 1, if you choose a self-managed, open-source or custom solution, it's in your control. For a managed solution, you are limited by the provider's feature set.

For 2, if you choose a self-managed solution, your team is responsible for the quality of the service. For a managed solution, you are dependent on the provider's service quality.

In most cases, you are better off using a managed solution from a reputed provider, unless you have:

  • Specific requirements that are not met by the vendor.
  • Budget constraints.

Monitoring Security Vulnerabilities in Your Cloud Vendors

· 7 min read
Hrishikesh Barua
Founder @IncidentHub.cloud

Introduction

If you manage applications running on cloud platforms, you likely depend on multiple cloud vendors and services. These could be infrastructure providers like AWS, GCP or Azure. A vulnerability in any of these services could potentially impact your applications and your users.

A cloud platform has many moving parts, many of which are dependent on other third-party providers. For example:

  • Operating system images for VMs which are maintained by third-party vendors.
  • Container images which are hosted on external repositories.
  • Software stacks which are maintained by other vendors but available for deployment on the cloud provider.
  • Libraries used by the cloud provider's internal software which are maintained by other developers or organizations.
  • Control plane software like Kubernetes.
  • Hardware, like processors, which are provided by the manufacturer.
  • Hypervisors which are developed and maintained by third-party vendors.
  • Networking hardware manufactured by other vendors.

Summarizing SRE/Ops Podcasts Using an LLM

· 2 min read
Hrishikesh Barua
Founder @IncidentHub.cloud

Introduction

There are plenty of good SRE/Ops related podcasts out there. I follow a few of them and listen to episodes whose titles sound interesting. The problem with podcasts is that some episodes focus on one topic, and other episodes deal with a host of topics. In between there is filler and things that are not relevant to the topic but are necessary to carry on a conversation. Spending 30-60 minutes listening to podcasts is not always a great use of time.

A while ago I decided to create a tool that summarizes podcasts for me using an LLM. If I find the summary interesting enough or there is something that I want to learn about, I go and listen to the entire episode. Such a tool might be useful to others also, so I made a website for it - https://www.srenews.info.

I encourage you to listen to the complete episodes if you find a summary interesting - they are linked from each summary page.

Sending Alerts Using Prometheus and Alertmanager

· 9 min read
Hrishikesh Barua
Founder @IncidentHub.cloud

Introduction

This article is part of a series on setting up an end-to-end monitoring and alerting stack using Prometheus.

Continuing our series on setting up Prometheus in a container, this article provides a step-by-step guide for how to configure alerts in Prometheus. We will add alerting rules and deploy Prometheus Alertmanager with Slack integration.

If you follow the steps in this article, you will end up with a containerized setup for:

  1. A Prometheus instance with alerting rules.
  2. An Alertmanager instance which can send alerts originating from those rules to a Slack channel.

Let's get started.

Prometheus alerts

Deploying Prometheus With Docker

· 4 min read
Hrishikesh Barua
Founder @IncidentHub.cloud

Introduction

This article is part of a series on setting up an end-to-end monitoring and alerting stack using Prometheus.

There are different ways you can use to deploy the Prometheus monitoring tool in your environment. One of the fastest ways to get started is to deploy it as a Docker container. This guide shows you how to quickly set up a minimal Prometheus on your laptop. You can then extend that setup to add a monitoring dashboard, alerting, and authentication.

Deploying Prometheus in a Docker Container

Basic Setup

Running Prometheus in Docker is very simple using this command

docker run -p 9000:9090 prometheus prom/prometheus:v3.0.0

This will pull and run the latest version (as of this writing) of Prometheus. You can access the Prometheus UI at localhost:9000. Note that the container port 9090 is forwarded to the localhost port 9000.

The 2024 List of Incident Management Resources

· 5 min read
Hrishikesh Barua
Founder @IncidentHub.cloud

Introduction

This article is an attempt to list the best incident management material and guides available for free on the internet. If I've missed something you think should be here, do let me know and I'll be happy to add it.

How to Configure a Remote Data Store for Prometheus

· 9 min read
Hrishikesh Barua
Founder @IncidentHub.cloud

Introduction

This article is part of a series on setting up an end-to-end monitoring and alerting stack using Prometheus.

The Prometheus monitoring tool can store its metrics either locally or remotely. You can configure a remote data store using the remote_write configuration. This article describes the various data store options available as well as how to set up a remote store.

Overview of Remote Storage

By default, Prometheus stores data locally wherever it is installed. The data directory can be configured by using the --storage.tsdb.path command line option when starting Prometheus. In practice you can use a separate disk for higher performance attached to the machine where Prometheus is running.

However, this may not be possible or optimal in all situations as you might want a data store that is more suited for time series data, and has larger storage capabilities for higher data retention. Prometheus would usually run in a standalone VM or a Kubernetes pod or a Docker container, and it would not have access to such data stores by default.

A remote store can add these capabilities to Prometheus. The remote storage option can be set by using the remote_write key in the Prometheus configuration YAML file.

A Beginner's Guide To Service Discovery in Prometheus

· 11 min read
Hrishikesh Barua
Founder @IncidentHub.cloud

Introduction

This article is part of a series on setting up an end-to-end monitoring and alerting stack using Prometheus.

Service discovery (SD) is a mechanism by which the Prometheus monitoring tool can discover monitorable targets automatically. Instead of listing down each and every target to be scraped in the Prometheus configuration, service discovery acts as a source of targets that Prometheus can query at runtime.

Service discovery becomes crucial when there are dynamically changing hosts, especially in microservices architectures and environments like Kubernetes. In Prometheus parlance, service discovery is a way of discovering "scrape targets".

For example, pods are created dynamically in Kubernetes as a result of new services being deployed and undeployed, autoscaling events, and errors causing pods to crash and go away. If you are using Prometheus for scraping pods in such an environment, Prometheus has to know which pods are running and scrapable at any given point in time. The Kubernetes service discovery pluging enables this. Similarly, there are SD plugins for other common environments.

You can use service discovery in Prometheus with the predefined plugins, or write your own custom ones using file or HTTP, depending on the situation.

Prometheus logo

The No-Nonsense Guide to Runbook Best Practices

· 9 min read
Hrishikesh Barua
Founder @IncidentHub.cloud

Introduction

Runbooks are a key part of incident management and preserve institutional knowledge. They can be used for both incident response as well as routine tasks like db maintenance and generating a complex report. We are mostly focused on incident response runbooks here.

Runbooks are a checklist

Best Practices

1. Runbook Structure

  • Establish a standard format that will be used across your organization. This will ensure consistency and help on-call folks to quickly figure out the steps even for runbooks they may not have seen before. It will also help in editing and maintaining the runbooks.
  • Get buy-in from your team on the decided format. If you don't have buy-in people might not want to maintain or use them.
  • Create the runbooks as decision trees. You don't need a visual guide here but include it if it's easy to create. Don't have too many branches in the tree - that will cause confusion. If you find yourself adding

The Ultimate List of Incident Management Tools in 2024

· 7 min read
Hrishikesh Barua
Founder @IncidentHub.cloud

Introduction

Incident management tools are important for organizations to effectively handle service outages. With so many incident management tools around with different feature sets, it's often difficult to find the one that is right for your needs. In this article, we attempt to make a list of incident management software available in 2024 with their features to help you arrive at the right one.

We have focused mostly on tools that offer incident management capabilities - which include at least incident lifecycle management, on-call scheduling, and third-party integrations.

There are many good tools which are focused only on incident response, or on monitoring and generating alerts, or on the ticketing aspect of incidents. We have not included those to avoid cluttering this article.

Incident Management Tools

Benefits of Using an Incident Management Tool

  • An incident management tool streamlines the incident management process by helping to define and automate workflows. It can help you create runbooks, alerting and escalation policies, and define and manage on-call schedules.
  • Incident Management software often come with integrations with your observability stack. Your observability stack is a key source of incidents. They can also integrate with your existing communication and