Skip to main content
Business Continuity

Continuous Monitoring: The Difference Between Knowing and Reacting

Continuous monitoring is sold as a dashboard product. It's actually an operating discipline. Here's what separates the teams that catch incidents early from the teams that learn about them from customers.

John Lane 2022-06-23 7 min read
Continuous Monitoring: The Difference Between Knowing and Reacting

"Continuous monitoring" is one of those phrases that has been sanded down by marketing until it means whatever the vendor wants it to mean. Most monitoring products are sold on the strength of pretty dashboards, hundreds of pre-built checks, and promises of AI-powered anomaly detection. Those are features. They are not continuous monitoring.

Continuous monitoring, as a discipline, is the practice of knowing the state of your systems well enough and fast enough to act on a problem before somebody else notices it. The dashboard is a tool. The alerting is a tool. The logs and metrics and traces are inputs. The actual thing — the capability — is an operating posture that most organizations aspire to and few achieve.

Here are three insights that separate the teams that run continuous monitoring from the teams that merely own a monitoring product.

1. Every Alert Must Be Actionable — Or It's Training People to Ignore You

The single biggest failure mode in monitoring programs is alert fatigue. It starts innocently: you deploy a new monitoring tool, you enable the out-of-the-box alerting policies, and suddenly your on-call engineer is getting 40 pages a day. Most of them are noise. Some are duplicates. A few are legitimate but not urgent. Maybe one or two per week are real.

Within three weeks, the on-call engineer is filtering pages by pattern-matching the subject line and ignoring categories of alerts entirely. Within three months, the useful alerts are getting lost in the noise and nobody trusts the system.

The fix is brutal but straightforward: every alert must have a runbook, a clear definition of actionable, and an owner who is accountable for removing the alert if it turns out to be noise.

What "actionable" means

An alert is actionable if and only if the recipient can, within 15 minutes of receiving it, answer three questions:

  1. Is this a real problem right now?
  2. What do I do about it?
  3. If I don't do anything, what happens?

If the answer to #1 is "I don't know, I have to investigate for an hour first," it's not an actionable alert. It's a data point that might later become an alert. Put it in a dashboard, not a pager.

If the answer to #2 is "I don't know, call the guy who wrote the system," it's not an actionable alert. It's an escalation trigger. Build a runbook or get rid of the alert.

If the answer to #3 is "nothing really, the system will self-heal," it's not an alert at all. It's telemetry. Turn off the page.

The discipline

Great monitoring teams ruthlessly prune their alert catalog. When an alert fires and the responder determines it was noise or not actionable, the next step is not to close the ticket — it's to fix the alert. Either raise the threshold, add a correlation rule, delete it entirely, or replace it with something that fires on the symptom the user actually cares about.

The monitoring systems that work have fewer alerts than they had a year ago, not more.

2. Monitor the User Experience, Not Just the Infrastructure

The second insight is philosophical but it drives real architecture decisions. Most monitoring programs start by instrumenting infrastructure: CPU, memory, disk, network, database connection counts, queue depths. These are necessary but not sufficient. They tell you how your systems feel. They do not tell you whether anybody can use them.

I've seen environments where every infrastructure metric was green and the website was down for external users because of a DNS misconfiguration, a certificate expiration, an upstream CDN problem, or a firewall rule added yesterday that nobody remembered. The monitoring stack had no idea anything was wrong because nothing in its field of view was broken.

The fix is to monitor the outcome, not just the cause.

Synthetic transactions

Run synthetic checks that exercise the actual user paths. Log in as a test user every 60 seconds. Add an item to the cart. Submit a form. Query the API the way a real client would. Do it from outside your network, from multiple geographies, from the actual client platforms your users run on.

When the synthetic check fails, you know somebody is having the same problem — possibly everybody. When the synthetic check succeeds and the infrastructure is screaming, you know the screaming is less urgent than it appears.

This costs money. Good synthetic monitoring is $200 to $2,000 per month depending on coverage. It is among the best dollars you will spend on monitoring because it converts vague infrastructure noise into a clear yes/no on the question "is the service working right now?"

Real user monitoring

Synthetic monitoring catches the problems affecting nobody or everybody. Real user monitoring (RUM) catches the long tail — the problems affecting 3% of users in a specific region, on a specific browser, at a specific time of day. It also surfaces performance degradation before it becomes an outage.

A service that takes 4 seconds to load instead of 400 milliseconds is not "down" by any infrastructure metric. But your users are leaving. RUM tells you this is happening. Infrastructure monitoring does not.

3. Know the Difference Between Detection and Response Time

The third insight is the one that separates teams that look like they have continuous monitoring from teams that actually do. It's not about detection — it's about the gap between detection and action.

I'll give you two scenarios. Both teams have the same monitoring tool, the same alerts, the same runbooks.

Team A: An alert fires at 2:47 AM. The on-call engineer receives a page. The engineer is asleep, wakes up, looks at the phone, acknowledges the alert, gets out of bed, opens a laptop, connects to VPN, logs into the monitoring system, reads the runbook, connects to the affected server, runs diagnostic commands, identifies the problem, implements the fix, verifies the fix, and updates the ticket. Elapsed time: 38 minutes. During those 38 minutes, the service was degraded and some users were affected.

Team B: Same alert, same time. The engineer receives the same page. But the runbook is attached to the alert as a one-click link, the fix is pre-authorized (the engineer doesn't need to call a manager at 2:47 AM), and the monitoring system has a pre-built "apply recommended remediation" button that runs a tested script. The engineer acknowledges, verifies the context, clicks the button, watches it work, and goes back to sleep. Elapsed time: 6 minutes.

Both teams "caught" the incident. Only one team had an operational response capability.

The levers that matter

The gap between detection and response is where most monitoring programs leave value on the table. The levers to shorten it are not flashy but they work:

  • Pre-authorized runbooks — the engineer does not need to wake up a manager to restart a service or scale a pool during a known-good incident type
  • Automation for known remediation patterns — restart the process, rotate the certificate, failover the database, scale the compute, drain the node, whatever the common fix is, script it and put it a click away
  • Context at the alert — include recent deploys, recent config changes, relevant dashboards, recent similar incidents. The engineer should not have to hunt for context at 3:00 AM
  • Clear ownership — who is allowed to make the call on disruptive actions (failover, rollback, emergency maintenance window) and how do you reach them
  • Blameless postmortems that actually drive change — the purpose of the postmortem is to reduce the response time for the next incident, not to assign blame for the current one

What to measure

If you want to know whether your continuous monitoring program is actually working, measure two things:

  1. Mean time to detect (MTTD) — how long between the problem starting and somebody knowing about it
  2. Mean time to resolve (MTTR) — how long between somebody knowing and the problem being fixed

MTTD is a monitoring question. MTTR is a response capability question. A mature program drives both down over time, but MTTR is where the leverage is for most teams, because most teams detect problems faster than they can fix them.

The Short Version

Continuous monitoring is not a product. It's an operating discipline that happens to use products. The teams that practice it well share three habits: they ruthlessly prune their alert catalogs so that every page means something, they monitor the user experience as aggressively as they monitor the infrastructure, and they invest in the response capability as much as in the detection capability.

Buy the tool. Then do the work. The tool is the easy part.

Talk with us about your infrastructure

Schedule a consultation with a solutions architect.

Schedule a Consultation
Talk to an expert →