Skip to main content

5 posts tagged with "ops"

IncidentHub posts related to ops

View All Tags

When Alerts Don’t Mean Downtime - Preventing SRE Fatigue

· 3 min read
Hrishikesh Barua
Founder @IncidentHub.cloud

Introduction

A recent question in an SRE forum triggered this train of thought.

How do I deal with alerts that are triggered by internal patching/release activities but don't actually cause a downtime? If we react to these alerts we might not have time to react to actual alerts that are affecting customers.

I've paraphrased the question to reflect its essence. There is plenty to unravel here.

My first reaction to this question was that the SRE who posted this is in a difficult place with systemic issues.

Systemic Issues

Without knowing more about the org and their alerting policies, let's look at what we can dig out based on this question alone

  • Patches/deployments trigger alerts
  • The team does not react to such alerts to avoid spending valuable time that can be directed towards solving downtime that is affecting customers
  • There is cognitive overhead of selectively reacting to some alerts, and ignoring others
  • The knowledge of which alerts to react to is something only the SRE team knows
  • Any MTTx data from such a setup are useless

The eventual impact is sub-optimal incident management, eventually affecting SLAs, and burnout in on-call folks.

Improving the SRE Experience

How would you approach fixing something like this?

Some thoughts, in no particular order

  • Setting the correct priority for alerts - Anything that affects customer perception of uptime, or can lead to data loss, is a P1. In larger organizations with independent teams responsible for their own microservices, I would extend the definition of customer to any team in your org that depends on your service(s). If you are responsible for an API used by a downstream service, they are your customers too.

  • Zero-downtime deployments - This is not as hard as it sounds if you design your systems with this goal in mind. For stateless web applications it is trivial to switch to a new version behind a load balancer. For stateful applications it can take a bit more work.

  • Maintenance mode - This can fall into two categories - maintenance mode that has to be communicated to the customer, and maintenance mode that is internal - affecting other teams who consume your service. At the alerting level, you temporarily silence the specific alerts that will get triggered by the rollout.

  • Investigate all alerts and disable useless ones - Not looking at an alert creates indeterminism and can lead to alert fatigue. The alerting system should be the single source of truth.

Solving such issues has to be a team effort involving the dev teams also. You can start by recognizing customer-facing uptime and having a sustainable on-call process as the priorities.

Incident Archaeology – Dig Into Your Services' Past With IncidentHub's Availability Page

· 3 min read
Hrishikesh Barua
Founder @IncidentHub.cloud

A few weeks ago we released a feature on IncidentHub which gives you a historical view of your monitored services' availability.

Why Was This Needed?

On the dashboard where you can add services and channels, there is an overview panel that shows total incidents in the last 24 hours. You can get into a more detailed view by clicking on the button next to it. This opens up a popup where you can see active and resolved incidents - in the last 24 hours - and filter them by service.

View Incidents Popup

This panel is good enough for a quick view on what's affecting your dependent services. However, sometimes there is a need to look back further. This is what the Availability page gives you - an overview of service health over the last 30 days.

Let's look at a few examples:

  • You are investigating an outage with your applications which had a significant impact and more than one cause. One of the reasons was an outage with one of your third-party services. You are writing the post-mortem report after 2 days and need to refer to the third-party outage's incident report, which you can find on the Availability page.
  • After starting a long-running performance test, you look at the result after a couple of days and notice a blip in the graph. You suspect your cloud provider's network had an issue 2 days ago. You can check the Availability page for your cloud provider's health at that time.
  • One of your customers raised a support ticket complaining about an unavailable API a few days ago. You need to check your own historical metrics, and if there was an incident, correlate that with your third-party services' uptime.

The Availability page looks like this:

Availability Page

Digging Deeper

The green bars show days when everything was fine as reported by the service's own status page. The red bars indicate when there were one or more incidents.

If you hover over the red bars, you would see one of two things:

Single Incident Days

When there was a single incident on that day, it will be a link whose text says "View Incident Details". Clicking on it will take you to the official incident page of the service.

Single Incident Day

Multiple Incident Days

When the service had multiple incidents on that day, the link text will say "Multiple incidents - click to visit the status page". This will take you to the official status page of the service.

Multiple Incidents Day

Some incidents can span multiple days. The Availability bars are a high-level view of a service's availability - they don't show the exact time of the outage. It's a quick and easy way to view the status of your third-party dependencies.

Find it useful? Something missing? Let us know - we are always looking for feedback. You can reach us at support@incidenthub.cloud or on X @Incident_Hub.

Follow the blog's feed or our LinkedIn page for more updates on exciting new features.

Monitoring Specific Components and Regions in Your Third-Party Services

· 3 min read
Hrishikesh Barua
Founder @IncidentHub.cloud

Chances are, most of your third-party cloud and SaaS dependencies are globally distributed and have many regions of operation. Chances are, your applications use a subset of a cloud or SaaS service. If you are monitoring such a service, why should you receive alerts for all regions or every single component in the service?

E.g. if you use Digital Ocean, you might be using Kubernetes in their US locations (NYC and SFO). You would want to know only when there is an outage in one of these locations. Digital Ocean's status page gives you the option to subscribe to outages across the board - it’s all or nothing. This is the case with most services with a few exceptions.

Choosing Specific Components to Monitor

You can now choose which components/regions you wish to monitor in IncidentHub. Let us continue with our Digital Ocean example.

You can choose to monitor all components:

Monitor all components

or a subset that is relevant to you:

Monitor specific components

Once you save this configuration, you will be alerted only for outages that affect these components.

Adding/Removing Components

You can always go back and edit the components later. This is helpful when you start using say, Kubernetes in a new region, or new components. In your IncidentHub dashboard, you should see the "Edit Components" button next to your list of services.

Edit components

Benefits

  • This new feature will help you to receive only relevant and actionable alerts. If you are a developer you need not worry about receiving irrelevant alerts for components your application does not even use.
  • SRE/Ops teams can react to infrastructure issues quicker without wading through noise and correlate those with outages reported in their own applications.
  • If you are in an IT Team with hundreds or thousands of users depending on tools like Zoom, Slack, or Google Workspace, you can react to issues before your users start logging helpdesk tickets.

This powerful new feature, which significantly reduces alert noise, is being rolled out to eligible services as of this writing. Log in to your IncidentHub account today to start customizing your monitoring settings. For a step-by-step guide on how to set up your custom monitoring preferences, check out our knowledge base article. We would love to hear how this new feature is working for you.

Watch this blog or our X/LinkedIn feeds for updates on more exciting new features.

Integrate Your Monitoring System With PagerDuty Using Events API V2

· 2 min read
Hrishikesh Barua
Founder @IncidentHub.cloud

PagerDuty's Events API V2 lets you push events from your monitoring systems to PagerDuty. You can push such events when there is a triggered, updated, or resolved incident.

The lifecycle of an incident will typically go through these states

StateTriggered BySource
TriggeredAutomaticMonitoring system
AcknowledgedOn-call EngineerPagerDuty app/Phone call
UpdatedAutomaticMonitoring system
ResolvedOn-call EngineerPagerDuty app/Phone call

You can either use any of the PagerDuty client SDKs to send events, or roll out your own.

Self-hosted and SaaS monitoring tools have inbuilt PagerDuty integration where you need to provider the API key.

A typical event push will look look like this (example in NodeJS):

import { event } from "@pagerduty/pdjs";

.....
event({
"data": {
"routing_key": "Your-Routing-Key-Here",
"event_action": "trigger",
"dedup_key": DEDUP_KEY,
"payload": {
"summary": "Event processor in us-east-1",
"source": "rnmd-2398.xyzcloud.io",
"severity": "critical",
"timestamp": "2024-07-17T08:42:58.315+0000",
},
"links": [
{
"href": "https://incidenthub.cloud/dashboard",
"text": "Go to dashboard",
},
],
},
.....

When your monitoring system sends this event to trigger an incident, it's important to have a unique DEDUP_KEY. This field determines whether subsequent events for this incident will be grouped together in PagerDuty. When your system sends an update, or a resolved event, the DEDUP_KEY must match the one sent during the trigger call. In other words, the DEDUP_KEY must be unique per incident.

IncidentHub integrates with PagerDuty and uses the incident's public URL as the DEDUP_KEY as that is unique globally, and also remains the same for an incident. Each incident update event has the same DEDUP_KEY.

Let us look at a Google Cloud example. An incident affecting Anthos Service Mesh in Nov 2023 went through 4 updates including trigger and resolve. The URL remained the same for the incident as it went through the lifecycle.

References

Monitoring Third Party Vendors as an Ops Engineer/SRE

· 3 min read
Hrishikesh Barua
Founder @IncidentHub.cloud

Why should you monitor your third-party Cloud and SaaS vendors if you are in SRE/Ops?

As part of an SRE team, your primary responsibility is ensuring the reliability of your applications. What makes you responsible for monitoring services that you don't even manage? Third-party services are just like yours - with SLAs. And outages happen, affecting you as well as many others who depend on them.

It's a no-brainer that you should know when such outages happen to be on top of things if/when it affects your running applications.

Most of your third party dependencies will have a public status page or a Twitter account where they publish updates on their outages. Here are some seemingly easy ways to monitor these pages

  • Subscribe to the RSS feed of these pages
  • Follow the Twitter account
  • Sign up for Slack, Email, SMS notifications on the status page itself if the page supports these

But if you have tried it, it's not that easy

  • Not all pages have RSS feeds
  • Some have Slack, Email, SMS integration - some don't
  • Some don't have a Twitter account
  • You need to sign up on all of these pages one by one, and all services may not support the same notification channel

You can easily end up doing this one by one for 10-15 or more service providers. Let's do a quick check. Which services in this list below do you use in your stack?

  • DNS - GCP/GoDaddy/UltraDNS/Route53
  • Cloud/PaaS - GCP/AWS/Azure/DigitalOcean/Heroku/Render/Railway/Hetzner
  • Monitoring - Grafana Cloud/DataDog/New Relic/SolarWinds
  • On-call management - PagerDuty/OpsGenie
  • Email - Google Workspace/Zoho
  • Communication - Zoom/Slack
  • Collaboration - Atlassian Jira/Confluence
  • Source code - GitLab/GitHub
  • CI/CD/GitOps - TravisCI/CircleCI/CodeFresh
  • CDN/Content delivery/ - Cloudflare/CDNJS/Fastly/Akamai
  • SMTP providers - SMTP.com/SendGrid
  • Payments - PayPal/Stripe
  • Artifact Repo - Maven/DockerHub.Quay.io
  • Others - OpenAI/Apple Dev Platform/Meta Platform
  • Marketing - MailChimp/Hubspot
  • Auth - Okta/Clerk/Auth0

This is a small list. You may not have all of these, or may have more/others, but you get the point.

Like any self-respecting Ops Engineer/SRE, you would probably want to whip up a script and write this check-pages-and-notify-in-one-place tool by yourself. I know, because I've worked in Ops/SRE roles for the better part of my career, and NIH is a very real thing. Here's why it's not a great idea

  • Any software you write has to be maintained. Say your org starts using a new service which does not have an RSS feed on the status page. What now?
  • Who monitors the monitor? How do you know when your script is not running?
  • You probably have better uses for your time

IncidentHub was built to solve precisely these problems - so you can focus on what's important, and hand off monitoring third-party services to something that was built with that goal in mind. So stop hacking together scripts to monitor public status pages, and try it out.