Skip to main content

The Benefits of a Single Incident Management System

· 2 min read
Hrishikesh Barua
Founder @IncidentHub.cloud

How many monitoring tools do you have?

Chances are at least 2-3. One tool usually does not cover all cases, and it’s usually a combination of self-managed and managed tools. Self-managed gives you more control over custom configurations and cost. Managed ones take away the headache of running it yourself.

Prometheus is the de-facto standard for monitoring these days if you have a modern application stack and you want to manage your own monitoring. It is metrics-based, i.e., it uses metrics as the source of data from all the monitored systems. There are ready-made exporters for almost all popular infrastructure components. You can send your application and business metrics to Prometheus too with OpenTelemetry exporters.

This model does not work for all aspects of your service. E.g. If you want to monitor external properties like your website, or use synthetic monitoring to check your customer-facing APIs from global locations, you could use something like Pingdom or UptimeRobot. This becomes another source of data about your service's uptime.

Many Monitors, One Incident Management System

A downside of having more than one monitoring system in place, regardless of the need, is that you have multiple sources of data. You have to consult multiple systems if you want to know the overall status. However, it is important that you receive alerts in one single incident and on-call management system. This allows a single place from where your on-call teams can get paged.

So ensuring that all your monitoring tools can integrate with your on-call system is crucial.

A typical Prometheus setup might look like:

Monitoring setup

If you have other monitoring systems, you should be able to route those alerts into your on-call/incident response system. Most tools support this:

Monitoring setup

IncidentHub monitors your external SaaS and cloud providers and notifies you when they have incidents. It can easily integrate into your existing incident management system.

Monitoring setup

If you’re using PagerDuty, just add a PagerDuty channel and you’re good to go. Check out the documentation for more.

Monitoring Your Third-Party Cloud and SaaS Services is Critical

· 3 min read
Hrishikesh Barua
Founder @IncidentHub.cloud

If you have a software-based business, you are using at least a few cloud based tools. It does not matter if you are a solo developer, or part of a 50-member team in a large organization. Take this random list and chances are you are using at least half of them:

  • Zoom
  • Google Workspace
  • Slack
  • Public cloud/PaaS - GCP/AWS/Azure/Render/Heroku/Railway/DigitalOcean/Hetzner
  • PagerDuty/Opsgenie
  • Cdnjs
  • DockerHub
  • GitLab/GitHub
  • TravisCI/CircleCI/Semaphore
  • Let’s Encrypt

Your entire business - irrespective of org or market size - including your development tools, collaboration/communication tools, infrastructure and hosting, monitoring, even email - is dependent on services that you don’t control. They are provided by other vendors.

Of course, you pay for some of them and they all have SLAs. Having an SLA does not translate to 100% uptime. Companies will try their best to meet SLAs - which promise a percentage of uptime (usually 99.xx). There are going to be incidents in your providers at some point, and the effect will cascade to the service that you provide to your customers. This means that your own product’s SLA can be breached due to causes outside your control.

Can you not ask the service provider to notify you directly when this happens? Unlikely, unless you are a really big enterprise. However, most of them have public status pages where you can sign up to receive these alerts over SMS, email, Slack, etc.

The downside is - if you have 50 such services, you have to sign up on 50 pages, one by one. If you want to change your notification channel (another Slack channel, or SMS instead of Slack), you have to edit it on each of those 50 pages.

How does knowing about such issues help you? A few examples (true stories) will illustrate this

  • Public cloud outages that are yet to impact your applications. If you get to know beforehand that your cloud vendor has an ongoing incident in your region, you can take preventive steps so that your applications are not affected.
  • Paging service outages. Your on-call teams can miss alerts because your paging service is unable to send alerts.
  • Delayed/missing messages in your communication tool. Your remote teams are not in sync because your comm tool is dropping only some, not all, messages.
  • Your hosted Git repo is throwing errors, while your customer waits for a critical bug fix.

Knowing that there is something wrong with the SaaS/cloud provider gives you an opportunity to do something about it, proactively.

There is no single place, no easy way where you can

  • Choose services to monitor
  • Choose a channel to receive alerts

This is why we built IncidentHub - based on years of real-world experience. The UI is very simple so that receiving your first alert does not involve more than 2 steps. Check out the demo video below, and try it out yourself at https://incidenthub.cloud/

Originally published at https://www.linkedin.com/feed/update/urn:li:activity:7196385217270415361/