Skip to main content

9 posts tagged with "monitoring"

IncidentHub posts related to monitoring

View All Tags

How To Monitor Public Status Pages of Cloud Providers - a Step-by-Step Approach

· 8 min read
Hrishikesh Barua
Founder @IncidentHub.cloud

Introduction

Incident updates on the public status pages of your cloud providers are often the first indication that they might have an outage. Providers also post updates about upcoming and ongoing maintenance on their status pages. Thus, monitoring your cloud status pages becomes crucial to your business operations. This article will guide you through the process of effectively monitoring such status pages.

  1. Identify Your Cloud Providers
  2. Locate Their Public Status Pages
  3. Understand the Status Page Structure
  4. Configure Notifications
  5. Best Practices
  6. Include in Your Incident Response Plan
  7. Use a Monitoring Tool
  8. Conclusion
  9. FAQ

Identify Your Cloud Providers

Work with your Dev/Ops/SRE and IT teams to come up with a comprehensive list of your cloud providers. Any service that is not managed by your teams is by definition a cloud service. Although we focus on Cloud providers - i.e. providers that let you deploy your services on their infrastructure - this article is equally applicable to any of your external SaaS vendors.

Locate Their Public Status Pages

Every cloud provider has a public status page. You can find the link either on their company website, or by doing a web search. The status page software is either managed by your cloud provider, or outsourced to another service like Atlassian Statuspage or Instatus. Many observability and incident management providers like Incident.io and BetterStack also offer public status pages.

Understand the Status Page Structure

There is no official standard for status page formats but most of them use a similar visual layout. The common terms used to describe incident states are "Major/Minor outage", "Maintenance", "Informational", "Monitoring", and "Resolved".

Status pages will have any ongoing incidents at the top, followed by a list of components or services, followed by past incidents. Clicking on the ongoing incident link will take you to a detailed description of the incident.

An example from the Twilio status page:

Twilio status

Configure Notifications

Instead of periodically visiting status pages you can choose to sign up to receive notifications when there is an incident created, updated or resolved. Depending on your provider, status pages offer different modes of notification.

  • SMS
  • Slack
  • Email
  • RSS feed
  • Google Chat
  • Discord
  • Webhooks

Some status pages offer only one or two options, or sometimes no options at all. If the status page is managed by someone other than your cloud provider, your cloud provider can choose to enable/disable some of the available notification options. For an example, both DigitalOcean and Mailgun use Atlassian Statuspage. DigitalOcean allows you to subscribe using many channels:

DigitalOcean status

whereas Mailgun has disabled all options

Mailgun status

This is as of this writing. Providers can modify these options over time depending on their business requirements.

Notification Challenges

Your notifications should be delivered in a way that ensures the right team in your organization receives the alerts. If the team uses Slack that is where you want the notifications. If it's Discord, the notifications should go to a Discord channel.

The status pages used by your providers can have different notification options, which can pose a challenge. They might not offer the option you want. Some providers may have your chosen option, some might not. See the section on Use a Monitoring Tool on how to mitigate this.

Best Practices

Filtering Your Monitors

Cloud providers have many, sometimes hundreds, services in different locations across the globe. A cloud provider's status page shows incidents across all of them. Your team should receive notifications only for the services they use, and in the regions they use them in. Most status pages have an option to choose the services and the regions. Utilize this feature so that your team is not flooded with unnecessary notifications.

E.g. The Fastmail status page which is hosted by Instatus has options to sign up for notifications for specific components: Fastmail status notifications

In some large cloud providers like Google Cloud, it can become difficult to sign up for specific components and regions. Let's say you use Google Kubernetes Engine in us-central1. Currently the Google Cloud status page offers no way to receive notifications for only GKE in us-central1.

Do Periodic Reviews

Status pages keep changing. Your cloud provider may choose to add/remove services, switch to a different status page provider, or add/remove notification modes.

Have a Single View Across All Providers

To check if any of your cloud providers have an outage, a single visual way where all your providers show up is a must. In the absence of a dedicated monitoring tool that monitors your cloud provider status pages, a poor substitute will be your notification channel. If it's Slack, you can configure the notifications to go into a specific Slack channel. However, it can be difficult to search for past incidents as well as look at ongoing incidents with Slack.

Include in Your Incident Response Plan

Irrespective of your chosen notification mode, ensure that your incident response plan includes cloud provider alerts. Determine the right priority of such alerts so that your team can respond effectively. Include cloud provider alerts in your incident response plans so that teams can correlate alerts from other parts of your systems with cloud provider alerts to dig down faster into the root cause.

Use a Monitoring Tool

As noted in the previous sections, there are various challenges to monitoring cloud providers' status pages by yourself, unless you have only one or two such providers. There are various tools which aim to solve these pain points. IncidentHub is a SaaS tool created specifically to solve these challenges faced by Dev/Ops/SRE and IT Teams. You can create a free account which comes with 20 status page monitors and try it out.

IncidentHub monitors hundreds of cloud provider status pages periodically. It can send you notifications over the medium you choose - Email, Slack, PagerDuty, Discord, MS Teams, etc. IncidentHub also gives you a single dashboard where you can view ongoing and past incidents with your cloud providers: Availability page

The Benefits of Using a Monitoring Tool

The benefits of using a dedicated tool which monitors cloud status pages:

  • Offers a single normalized view across cloud providers' status pages
  • Hides the complexity of different status page formats
  • Detects and handles changing status page formats over time
  • Lets you choose the notification mode you want for alerts
  • Offers notification modes not available in the status page

Conclusion

Monitoring public status pages of cloud providers should form a key part of your monitoring strategy to maintain operational effectiveness and customer trust. Your team can stay informed and responsive during cloud service disruptions. There are various challenges in doing this by yourself - heterogeneous status page formats, non-overlapping notification modes, non-standard incident updates, and changing status page structures. A status page monitoring tool like IncidentHub can mitigate all these issues.

FAQ

Why should I monitor my cloud provider status pages?

Your cloud providers publish information about ongoing incidents and maintenance on their public status pages. Such disruptions can affect your business operations.

What if I am not able to locate a cloud provider's status page?

Cloud providers have a link to their status page on their website or you can find it using web search. If you are unable to locate it please get in touch with us at support@incidenthub.cloud and we will try our best to help you.

What is the best way to receive notifications?

The best way to receive notifications about cloud provider incidents is specific to your team. Discuss with your team what would make it most effective.

Is there a standard status page format?

There is no standard for a status page format. However, many cloud providers use one of the popular status page services like Atlassian Statuspage or Instatus. Providers using the same status page service will have a similar format. Some providers have their own format - like Google Cloud and Amazon Web Services.

What are the benefits of using a dedicated status page monitoring tool?

A dedicated status page monitoring tool smoothens out the differences between different cloud providers' status pages and gives you the option to receive notifications in your chosen way.

A Step by Step Guide to Checking if a SaaS is Down

· 6 min read
Hrishikesh Barua
Founder @IncidentHub.cloud

Introduction

Modern businesses depend heavily on Software as a Service (SaaS). Almost all aspects of business operations - accounting, HR, payroll, marketing, IT, sales, support - depend on one or more SaaS applications. SaaS is not limited to being used by software development teams. Given this dependency on SaaS applications, their uptime becomes tightly tied to a business's uptime. Any SaaS downtime can affect both a business's daily operations as well as the user experience.

How to check if a SaaS is experiencing downtime? Follow the steps below:

  1. Visit the SaaS Provider's Status Page
  2. Use External Monitoring Services
  3. Check Social Media
  4. Run Manual Tests
  5. Incident Communication
  6. Conclusion
  7. FAQ
  8. Popular SaaS Service Statuses

Visit the SaaS Provider's Status Page

The SaaS provider's status page will have first-hand information about ongoing issues.

Locate the SaaS provider's Status Page

You can find this by either doing a web search like "Zoom status page" or "OpenAI status page". You can also visit the SaaS provider's website and look for the status page link - it is usually in the footer. Another option is to check their documentation. If it's not available ask on their social media handles.

Understanding Status Pages

A SaaS provider's status page will indicate if the service is experiencing any downtime. Common status indicators are

  • Degraded performance
  • Service disruption
  • Partial outage

For example, take a look at the OpenAI status page

OpenAI status

Status pages also show you past incidents:

OpenAI past incidents

You can find more information about the outage by clicking on the downtime link on the status page. It will have details about which components or services are affected by the outage. If your SaaS has many independent locations, like a cloud provider, look for region/zone information as well. It's possible that the outage is limited to some components or locations. Check if any of the components or services you use are in the list. If it's a cloud provider or a similar service, check if the affected locations are among the ones that you use.

E.g. this Google Cloud outage affected Google Compute Engine in the asia-northeast1 region.

Google Cloud incident

Use External Monitoring Services

There are many monitoring tools that can track SaaS uptime. They are designed to continuously check the availability of SaaS services. These tools take away the hassle of you having to check uptime manually, especially if you have many SaaS applications. Checking the status page of each SaaS application is cumbersome. A status page monitoring tool like IncidentHub can make very easy by showing you the overall status of all your SaaS providers in one place.

Setting Up a Monitoring Tool

IncidentHub is a monitoring tool that checks official status pages of hundreds of SaaS applications. It notifies you in real-time if there is an outage or downtime. Setting up IncidentHub is just a few steps

Check Social Media

Twitter and Reddit are popular platforms to find SaaS outage information. Users post on these platforms to find more information and to check if others are also experiencing similar downtimes with the service. Such platforms can often provide real-time updates from other users. A caveat here is that if the outage is localized to some components or regions, you may not always find information about it on social media.

If your SaaS has a sub-Reddit, check the latest postings there for information.

Other community forums where users of the SaaS hang out can also provide important outage information.

Run Manual Tests

Running manual tests is another way to check if your SaaS is experiencing downtime. Check common functionality issues such as login failures, API errors, resource creation issues, and other specific functionalities. Correlate these with the official status page data, and what other users are reporting on social media. This is more of an ad-hoc method but it can add valuable information.

Incident Communication

It's very important to communicate with your team and your stakeholders about ongoing SaaS incidents. Your users and other business stakeholders should be notified as soon as you know there is an outage. This helps them to plan their work accordingly, and also decreases the number of user requests and helpdesk tickets you might get.

Incident communication is effective when you continuously share updates as they occur. It builds trust with your users. It's even better if users can check the status of their SaaS applications themselves on a status page or a dashboard.

Incident dashboard

Create alerts in your monitoring tool to inform your team about the status of services. Monitoring tools can integrate with most communication tools like email, Slack, Discord, etc.

Conclusion

In summary, you can check if your SaaS applications are down by checking the official status pages, using a monitoring tool, checking social media, and running manual tests. Keep communicating with your users about the current status.

This guide offers a clear method for users to quickly determine if their SaaS applications are down.

FAQ

How can you locate a SaaS provider's status page?

Check the SaaS provider's website, or run a web search.

Why is an external monitoring service important to track SaaS outages?

External monitoring tools continuously check SaaS status pages and other sources for incidents. They also check multiple SaaS providers at the same time. Doing this yourself is impractical and time-consuming.

How can you use social media to find out about SaaS downtime?

Popular social media channels like Twitter and Reddit often have real-time updates about SaaS outages from users who are experiencing downtime. SaaS-specific subreddits can be a good source of such information.

Popular SaaS Service Statuses

Airtable status
Akamai status
Azure status
Cloudflare status
Coinbase status
Discord status
Dropbox status
Fortnite status
GitHub status
Google Cloud status
Hetzner status
npm status
OpenAI status
PayPal status
Railway status
Reddit status
Rollbar status
SendGrid status
Twilio status
Vercel status
Zapier status

When Alerts Don't Mean Downtime - Preventing SRE Fatigue

· 3 min read
Hrishikesh Barua
Founder @IncidentHub.cloud

Introduction

A recent question in an SRE forum triggered this train of thought.

How do I deal with alerts that are triggered by internal patching/release activities but don't actually cause a downtime? If we react to these alerts we might not have time to react to actual alerts that are affecting customers.

I've paraphrased the question to reflect its essence. There is plenty to unravel here.

My first reaction to this question was that the SRE who posted this is in a difficult place with systemic issues.

Systemic Issues

Without knowing more about the org and their alerting policies, let's look at what we can dig out based on this question alone

  • Patches/deployments trigger alerts
  • The team does not react to such alerts to avoid spending valuable time that can be directed towards solving downtime that is affecting customers
  • There is cognitive overhead of selectively reacting to some alerts, and ignoring others
  • The knowledge of which alerts to react to is something only the SRE team knows
  • Any MTTx data from such a setup are useless

The eventual impact is sub-optimal incident management, eventually affecting SLAs, and burnout in on-call folks.

Improving the SRE Experience

How would you approach fixing something like this?

Some thoughts, in no particular order

  • Setting the correct priority for alerts - Anything that affects customer perception of uptime, or can lead to data loss, is a P1. In larger organizations with independent teams responsible for their own microservices, I would extend the definition of customer to any team in your org that depends on your service(s). If you are responsible for an API used by a downstream service, they are your customers too.

  • Zero-downtime deployments - This is not as hard as it sounds if you design your systems with this goal in mind. For stateless web applications it is trivial to switch to a new version behind a load balancer. For stateful applications it can take a bit more work.

  • Maintenance mode - This can fall into two categories - maintenance mode that has to be communicated to the customer, and maintenance mode that is internal - affecting other teams who consume your service. At the alerting level, you temporarily silence the specific alerts that will get triggered by the rollout.

  • Investigate all alerts and disable useless ones - Not looking at an alert creates indeterminism and can lead to alert fatigue. The alerting system should be the single source of truth.

Solving such issues has to be a team effort involving the dev teams also. You can start by recognizing customer-facing uptime and having a sustainable on-call process as the priorities.

Incident Archaeology – Dig Into Your Services' Past With IncidentHub's Availability Page

· 3 min read
Hrishikesh Barua
Founder @IncidentHub.cloud

A few weeks ago we released a feature on IncidentHub which gives you a historical view of your monitored services' availability.

Why Was This Needed?

On the dashboard where you can add services and channels, there is an overview panel that shows total incidents in the last 24 hours. You can get into a more detailed view by clicking on the button next to it. This opens up a popup where you can see active and resolved incidents - in the last 24 hours - and filter them by service.

View Incidents Popup

This panel is good enough for a quick view on what's affecting your dependent services. However, sometimes there is a need to look back further. This is what the Availability page gives you - an overview of service health over the last 30 days.

Let's look at a few examples:

  • You are investigating an outage with your applications which had a significant impact and more than one cause. One of the reasons was an outage with one of your third-party services. You are writing the post-mortem report after 2 days and need to refer to the third-party outage's incident report, which you can find on the Availability page.
  • After starting a long-running performance test, you look at the result after a couple of days and notice a blip in the graph. You suspect your cloud provider's network had an issue 2 days ago. You can check the Availability page for your cloud provider's health at that time.
  • One of your customers raised a support ticket complaining about an unavailable API a few days ago. You need to check your own historical metrics, and if there was an incident, correlate that with your third-party services' uptime.

The Availability page looks like this:

Availability Page

Digging Deeper

The green bars show days when everything was fine as reported by the service's own status page. The red bars indicate when there were one or more incidents.

If you hover over the red bars, you would see one of two things:

Single Incident Days

When there was a single incident on that day, it will be a link whose text says "View Incident Details". Clicking on it will take you to the official incident page of the service.

Single Incident Day

Multiple Incident Days

When the service had multiple incidents on that day, the link text will say "Multiple incidents - click to visit the status page". This will take you to the official status page of the service.

Multiple Incidents Day

Some incidents can span multiple days. The Availability bars are a high-level view of a service's availability - they don't show the exact time of the outage. It's a quick and easy way to view the status of your third-party dependencies.

Find it useful? Something missing? Let us know - we are always looking for feedback. You can reach us at support@incidenthub.cloud or on X @Incident_Hub.

Follow the blog's feed or our LinkedIn page for more updates on exciting new features.

Monitoring Specific Components and Regions in Your Third-Party Services

· 3 min read
Hrishikesh Barua
Founder @IncidentHub.cloud

Chances are, most of your third-party cloud and SaaS dependencies are globally distributed and have many regions of operation. Chances are, your applications use a subset of a cloud or SaaS service. If you are monitoring such a service, why should you receive alerts for all regions or every single component in the service?

E.g. if you use Digital Ocean, you might be using Kubernetes in their US locations (NYC and SFO). You would want to know only when there is an outage in one of these locations. Digital Ocean's status page gives you the option to subscribe to outages across the board - it’s all or nothing. This is the case with most services with a few exceptions.

Choosing Specific Components to Monitor

You can now choose which components/regions you wish to monitor in IncidentHub. Let us continue with our Digital Ocean example.

You can choose to monitor all components:

Monitor all components

or a subset that is relevant to you:

Monitor specific components

Once you save this configuration, you will be alerted only for outages that affect these components.

Adding/Removing Components

You can always go back and edit the components later. This is helpful when you start using say, Kubernetes in a new region, or new components. In your IncidentHub dashboard, you should see the "Edit Components" button next to your list of services.

Edit components

Benefits

  • This new feature will help you to receive only relevant and actionable alerts. If you are a developer you need not worry about receiving irrelevant alerts for components your application does not even use.
  • SRE/Ops teams can react to infrastructure issues quicker without wading through noise and correlate those with outages reported in their own applications.
  • If you are in an IT Team with hundreds or thousands of users depending on tools like Zoom, Slack, or Google Workspace, you can react to issues before your users start logging helpdesk tickets.

This powerful new feature, which significantly reduces alert noise, is being rolled out to eligible services as of this writing. Log in to your IncidentHub account today to start customizing your monitoring settings. For a step-by-step guide on how to set up your custom monitoring preferences, check out our knowledge base article. We would love to hear how this new feature is working for you.

Watch this blog or our X/LinkedIn feeds for updates on more exciting new features.

Integrate Your Monitoring System With PagerDuty Using Events API V2

· 3 min read
Hrishikesh Barua
Founder @IncidentHub.cloud

Introduction

PagerDuty's Events API V2 lets you push events from your monitoring systems to PagerDuty. You can push such events when there is a triggered, updated, or resolved incident.

Incident Lifecycle

The lifecycle of an incident will typically go through these states

StateTriggered BySource
TriggeredAutomaticMonitoring system
AcknowledgedOn-call EngineerPagerDuty app/Phone call
UpdatedAutomaticMonitoring system
ResolvedOn-call EngineerPagerDuty app/Phone call

You can either use any of the PagerDuty client SDKs to send events, or roll out your own.

PagerDuty Integration

Many self-hosted and SaaS monitoring tools have inbuilt PagerDuty integration. This involves getting a PagerDuty API key and add it to your monitoring tool's configuration.

How To Get a PagerDuty Integration Key

You can generate a PageDuty integration key from your PagerDuty account. You can [follow these steps](https://docs.incidenthub.cloud/welcome-to-the-incidenthub-documentation/ channels/pagerduty-integration) to get they API key

Integrating With the PagerDuty API

A typical event push will look look like this (example in NodeJS):

import { event } from "@pagerduty/pdjs";

.....
event({
"data": {
"routing_key": "Your-Routing-Key-Here",
"event_action": "trigger",
"dedup_key": DEDUP_KEY,
"payload": {
"summary": "Event processor in us-east-1",
"source": "rnmd-2398.xyzcloud.io",
"severity": "critical",
"timestamp": "2024-07-17T08:42:58.315+0000",
},
"links": [
{
"href": "https://incidenthub.cloud/dashboard",
"text": "Go to dashboard",
},
],
},
.....

When your monitoring system sends this event to trigger an incident, it's important to have a unique DEDUP_KEY. This field determines whether subsequent events for this incident will be grouped together in PagerDuty. When your system sends an update, or a resolved event, the DEDUP_KEY must match the one sent during the trigger call. In other words, the DEDUP_KEY must be unique per incident.

IncidentHub integrates with PagerDuty and uses the incident's public URL as the DEDUP_KEY as that is unique globally, and also remains the same for an incident. Each incident update event has the same DEDUP_KEY.

Let us look at a Google Cloud example. An incident affecting Anthos Service Mesh in Nov 2023 went through 4 updates including trigger and resolve. The URL remained the same for the incident as it went through the lifecycle.

Google cloud incident lifecycle

Conclusion

Integrating with the PagerDuty API is straightforward. While integrating make sure you keep these things in mind:

  • Map your monitoring system's alert priorities to the appropriate PagerDuty levels
  • Use the correct DEDUP_KEY for each case
  • Provide additional details about the incident using the custom fields. E.g. in the example above, you can provide more information about the incident using a custom link pointing to the dashboard

References

Monitoring Third Party Vendors as an Ops Engineer/SRE

· 3 min read
Hrishikesh Barua
Founder @IncidentHub.cloud

Why should you monitor your third-party Cloud and SaaS vendors if you are in SRE/Ops?

As part of an SRE team, your primary responsibility is ensuring the reliability of your applications. What makes you responsible for monitoring services that you don't even manage? Third-party services are just like yours - with SLAs. And outages happen, affecting you as well as many others who depend on them.

It's a no-brainer that you should know when such outages happen to be on top of things if/when it affects your running applications.

Most of your third party dependencies will have a public status page or a Twitter account where they publish updates on their outages. Here are some seemingly easy ways to monitor these pages

  • Subscribe to the RSS feed of these pages
  • Follow the Twitter account
  • Sign up for Slack, Email, SMS notifications on the status page itself if the page supports these

But if you have tried it, it's not that easy

  • Not all pages have RSS feeds
  • Some have Slack, Email, SMS integration - some don't
  • Some don't have a Twitter account
  • You need to sign up on all of these pages one by one, and all services may not support the same notification channel

You can easily end up doing this one by one for 10-15 or more service providers. Let's do a quick check. Which services in this list below do you use in your stack?

  • DNS - GCP/GoDaddy/UltraDNS/Route53
  • Cloud/PaaS - GCP/AWS/Azure/DigitalOcean/Heroku/Render/Railway/Hetzner
  • Monitoring - Grafana Cloud/DataDog/New Relic/SolarWinds
  • On-call management - PagerDuty/OpsGenie
  • Email - Google Workspace/Zoho
  • Communication - Zoom/Slack
  • Collaboration - Atlassian Jira/Confluence
  • Source code - GitLab/GitHub
  • CI/CD/GitOps - TravisCI/CircleCI/CodeFresh
  • CDN/Content delivery/ - Cloudflare/CDNJS/Fastly/Akamai
  • SMTP providers - SMTP.com/SendGrid
  • Payments - PayPal/Stripe
  • Artifact Repo - Maven/DockerHub.Quay.io
  • Others - OpenAI/Apple Dev Platform/Meta Platform
  • Marketing - MailChimp/Hubspot
  • Auth - Okta/Clerk/Auth0

This is a small list. You may not have all of these, or may have more/others, but you get the point.

Like any self-respecting Ops Engineer/SRE, you would probably want to whip up a script and write this check-pages-and-notify-in-one-place tool by yourself. I know, because I've worked in Ops/SRE roles for the better part of my career, and NIH is a very real thing. Here's why it's not a great idea

  • Any software you write has to be maintained. Say your org starts using a new service which does not have an RSS feed on the status page. What now?
  • Who monitors the monitor? How do you know when your script is not running?
  • You probably have better uses for your time

IncidentHub was built to solve precisely these problems - so you can focus on what's important, and hand off monitoring third-party services to something that was built with that goal in mind. So stop hacking together scripts to monitor public status pages, and try it out.

The Benefits of a Single Incident Management System

· 2 min read
Hrishikesh Barua
Founder @IncidentHub.cloud

How many monitoring tools do you have?

Chances are at least 2-3. One tool usually does not cover all cases, and it’s usually a combination of self-managed and managed tools. Self-managed gives you more control over custom configurations and cost. Managed ones take away the headache of running it yourself.

Prometheus is the de-facto standard for monitoring these days if you have a modern application stack and you want to manage your own monitoring. It is metrics-based, i.e., it uses metrics as the source of data from all the monitored systems. There are ready-made exporters for almost all popular infrastructure components. You can send your application and business metrics to Prometheus too with OpenTelemetry exporters.

This model does not work for all aspects of your service. E.g. If you want to monitor external properties like your website, or use synthetic monitoring to check your customer-facing APIs from global locations, you could use something like Pingdom or UptimeRobot. This becomes another source of data about your service's uptime.

Many Monitors, One Incident Management System

A downside of having more than one monitoring system in place, regardless of the need, is that you have multiple sources of data. You have to consult multiple systems if you want to know the overall status. However, it is important that you receive alerts in one single incident and on-call management system. This allows a single place from where your on-call teams can get paged.

So ensuring that all your monitoring tools can integrate with your on-call system is crucial.

A typical Prometheus setup might look like:

Monitoring setup

If you have other monitoring systems, you should be able to route those alerts into your on-call/incident response system. Most tools support this:

Monitoring setup

IncidentHub monitors your external SaaS and cloud providers and notifies you when they have incidents. It can easily integrate into your existing incident management system.

Monitoring setup

If you’re using PagerDuty, just add a PagerDuty channel and you’re good to go. Check out the documentation for more.

Monitoring Your Third-Party Cloud and SaaS Services is Critical

· 3 min read
Hrishikesh Barua
Founder @IncidentHub.cloud

If you have a software-based business, you are using at least a few cloud based tools. It does not matter if you are a solo developer, or part of a 50-member team in a large organization. Take this random list and chances are you are using at least half of them:

Your entire business - irrespective of org or market size - including your development tools, collaboration/communication tools, infrastructure and hosting, monitoring, even email - is dependent on services that you don’t control. They are provided by other vendors.

Of course, you pay for some of them and they all have SLAs. Having an SLA does not translate to 100% uptime. Companies will try their best to meet SLAs - which promise a percentage of uptime (usually 99.xx). There are going to be incidents in your providers at some point, and the effect will cascade to the service that you provide to your customers. This means that your own product’s SLA can be breached due to causes outside your control.

Can you not ask the service provider to notify you directly when this happens? Unlikely, unless you are a really big enterprise. However, most of them have public status pages where you can sign up to receive these alerts over SMS, email, Slack, etc.

The downside is - if you have 50 such services, you have to sign up on 50 pages, one by one. If you want to change your notification channel (another Slack channel, or SMS instead of Slack), you have to edit it on each of those 50 pages.

How does knowing about such issues help you? A few examples (true stories) will illustrate this

  • Public cloud outages that are yet to impact your applications. If you get to know beforehand that your cloud vendor has an ongoing incident in your region, you can take preventive steps so that your applications are not affected.
  • Paging service outages. Your on-call teams can miss alerts because your paging service is unable to send alerts.
  • Delayed/missing messages in your communication tool. Your remote teams are not in sync because your comm tool is dropping only some, not all, messages.
  • Your hosted Git repo is throwing errors, while your customer waits for a critical bug fix.

Knowing that there is something wrong with the SaaS/cloud provider gives you an opportunity to do something about it, proactively.

There is no single place, no easy way where you can

  • Choose services to monitor
  • Choose a channel to receive alerts

This is why we built IncidentHub - based on years of real-world experience. The UI is very simple so that receiving your first alert does not involve more than 2 steps. Check out the demo video below, and try it out yourself at https://incidenthub.cloud/

Originally published on LinkedIn