Skip to main content
Cloud

Cloud Monitoring: Three Outcomes That Justify the Tooling Bill

Datadog, New Relic, and Dynatrace aren't cheap. Here are the three outcomes that justify the spend — and what to do if you can't justify it.

John Lane 2023-10-24 5 min read
Cloud Monitoring: Three Outcomes That Justify the Tooling Bill

The first time a mid-market CFO sees the Datadog bill, they ask why. The honest answer is that a good monitoring stack — whether it's Datadog, New Relic, Dynatrace, Grafana Cloud, or a self-hosted Prometheus-plus-Loki setup — does three specific things that a bad one can't, and those three things are the difference between a team that ships confidently and a team that lives in fear of production.

Here are the three. If your monitoring isn't delivering all three, you're either using the wrong tool or using the right tool wrong.

1. Mean time to detection measured in seconds, not hours

The single biggest benefit of real cloud monitoring is that you find out about problems before your customers do. That sounds obvious, and every vendor claims it, but most organizations don't actually have it. We have walked into customer environments where the first signal of a production outage is a support ticket from a human being noticing a broken page. By the time the support team has escalated, the problem has been running for 40 minutes.

The difference between a 40-minute MTTD and a 40-second MTTD is not the monitoring tool — it's what you do with it. A proper setup means:

  • Every user-facing endpoint has a synthetic check running from an external location every minute.
  • Every critical service publishes a handful of SLI metrics (latency, error rate, saturation) with burn-rate alerts.
  • The alerts go to a pager, not an email inbox, and the pager rotation actually exists.
  • Noisy alerts get tuned or deleted within a week, so the signal stays trustworthy.

When all of that is in place, you detect most outages within a minute of them starting. When any of it is missing, you detect outages when customers complain. The tooling bill is paying for the discipline as much as the software.

2. Actual debugging data when production breaks

The second outcome is that when something does break, you have the data to figure out what happened. This is where most self-rolled monitoring setups fall apart. Teams collect metrics and dashboards, but when an incident hits they realize they don't have traces, their logs don't correlate to the request, and their error rates tell them something is wrong but not what.

A production-grade monitoring stack means distributed tracing that actually traces (not just the first three hops), logs that are joinable to traces by request ID, metrics that let you drill from "error rate went up" to "which service, which endpoint, which version" in under a minute, and retention long enough to investigate something that started two days ago and only got noticed yesterday.

The rule of thumb we use: if your on-call engineer cannot go from a pager alert to a root cause hypothesis in 15 minutes without asking another engineer for help, your observability isn't where it needs to be. The tool probably isn't the problem — the instrumentation is. Most teams under-instrument their applications and then blame the monitoring tool for not knowing what they didn't tell it.

If you're using a paid monitoring product and it still can't answer "why is this request slow," either your instrumentation is missing or you're not using the product's features. We've seen both. Pay for the tool, then pay for the week of engineering time to wire it up properly.

3. A feedback loop that makes the next change safer

This is the outcome nobody talks about in the vendor pitch but the one that actually makes the biggest long-run difference. Good monitoring turns deploys from "cross your fingers" into "watch the graph." Every deploy compares its metrics to the previous version. Every feature flag rollout is tracked against user experience. Every config change is visible in the same dashboard as the services it affects.

Teams that have this feedback loop in place deploy more often, roll back faster, and experiment more confidently. Teams that don't have it tend to batch changes into large Friday-afternoon deploys, which is the worst possible time for something to go wrong.

The benefit is not the monitoring itself — it's the culture change that good monitoring enables. A team that can see the effects of its deploys in real time becomes a team that isn't afraid to deploy.

What this costs, honestly

A reasonable monitoring budget for a mid-market production environment is somewhere between 3 and 7 percent of the underlying infrastructure cost. If you're spending $50k per month on cloud compute, expect to spend $1,500 to $3,500 per month on monitoring to cover it properly. That number goes up if you're running high-cardinality metrics, long log retention, or distributed tracing at high volume. It goes down if you're disciplined about what you instrument.

The expensive vendors (Datadog, Dynatrace, New Relic) will easily cost 2x that if you're not careful. The cheaper routes (Grafana Cloud, self-hosted Prometheus/Loki/Tempo, Honeycomb for traces) can cut the bill in half at the cost of more engineering time. We run both models depending on customer preference. Neither is wrong.

What is wrong is paying for the expensive vendor and then still having 40-minute MTTD because nobody set up the alerts. We see that more often than we'd like.

What to do if you can't justify the bill

Sometimes a customer looks at the number and says "we don't have the budget for this." Fair enough. The minimum-viable version of cloud monitoring is:

  • Synthetic checks on your most important endpoints (Uptime Kuma or Better Stack, $20 to $50/month).
  • Basic infrastructure metrics from the cloud provider's native tooling (CloudWatch, Azure Monitor, Stackdriver).
  • Application logs going somewhere queryable (even an S3 bucket with Athena is better than nothing).
  • A PagerDuty account for at least one on-call engineer.

This costs a few hundred dollars a month and will get you to the "we know when something is broken" baseline. It won't give you the root-cause speed of a full-featured platform, but it will keep you from learning about outages from customer tweets.

Three Takeaways

  1. The value of monitoring is measured in MTTD and MTTR, not in how many dashboards you have. If the graph exists but nobody looks at it, it isn't monitoring.
  2. Instrumentation is the expensive part. The tool is the cheap part. Budget engineering time to wire things up properly.
  3. Monitoring enables deploy confidence. The cultural shift toward frequent, safe deploys is the long-run ROI that rarely gets quoted.

Talk with us about your infrastructure

Schedule a consultation with a solutions architect.

Schedule a Consultation
Talk to an expert →