Skip to main content
Infrastructure

Monitoring Stack Choices for Hybrid Infrastructure

An honest comparison of the monitoring and observability tools we actually deploy — and the ones we have migrated customers off of — for hybrid environments spanning on-prem, cloud, and SaaS.

John Lane 2022-08-24 8 min read
Monitoring Stack Choices for Hybrid Infrastructure

"Which monitoring tool should we buy?" is the wrong question. The right questions are: what are you trying to observe, who is going to use the data, and what is the total cost of ownership once you include the person-hours to run the thing? A monitoring stack that nobody checks is a line item, not a capability. We have deployed, operated, and ripped out most of the major tools in this space over the years. Here is the honest comparison, scoped to hybrid environments — not pure cloud-native shops where the answer is usually simpler.

What a Monitoring Stack Has to Do

Before the tool comparison, the capability checklist. A real monitoring stack for a hybrid environment has to cover six things, and most tools only cover four of them well. If you are buying one tool and expecting all six, you are buying disappointment.

  1. Infrastructure metrics. CPU, memory, disk, network, power, temperature, interface counters. On servers, on switches, on PDUs, on UPS units, on chillers. Time-series data at 30-to-60-second resolution with 13 months of retention for trending.
  2. Application performance. Request rates, error rates, latency distributions — the RED metrics — with the ability to trace a slow request across service boundaries.
  3. Logs. Structured where possible, unstructured where forced, searchable within seconds, retainable for however long your compliance team says.
  4. Synthetic checks. External probes that prove the customer-facing experience actually works from the network viewpoint your users have, not from inside your own data center.
  5. Events and alerts. Routing, deduplication, escalation, and the ability to silence noise during known maintenance without losing the signal underneath.
  6. SLOs and reporting. Business-facing availability and performance numbers that a non-engineer can understand.

A mature stack usually combines two or three specialized tools behind a unified alerting and dashboarding layer. Single-vendor stacks exist — Datadog can cover almost everything on one bill — but the bill is the tradeoff.

The Tools We Actually Use

Prometheus plus Grafana plus Alertmanager

The open-source trio is the default starting point for infrastructure and application metrics in almost every customer environment we build. It is not the flashiest choice, but it is the one we trust most for metrics work.

Strengths: Enormous exporter ecosystem — there is a Prometheus exporter for essentially every database, message broker, hardware vendor, and cloud service you can name. Query language (PromQL) is genuinely powerful for the RED method and for custom recording rules. Grafana's dashboards are industry standard and the team you hire next month will already know them. Alertmanager handles routing, dedup, and silencing well once you invest in the configuration.

Weaknesses: Long-term storage requires a second decision — Thanos, Cortex, Mimir, or VictoriaMetrics. Each has tradeoffs and each is more complexity to run. The scrape-based model is not a natural fit for ephemeral cloud workloads; you need the Prometheus Operator in Kubernetes and some plumbing for Lambda-style functions. Logs are not its job.

When we deploy it: Almost always, as the metrics backbone. We pair it with VictoriaMetrics for long-term storage in larger environments because the operational overhead is lower than Thanos.

Grafana Loki for logs

Loki is the log backend we default to when we already have Grafana in place and the log volume is modest — say under a few hundred GB per day. It is cheaper to run than Elasticsearch and the LogQL query language is a natural extension of PromQL.

Strengths: Cheap to store at volume because it indexes labels, not log content. Integrates cleanly with Grafana. Operations team already knows the query language if they know PromQL.

Weaknesses: The label-only indexing model means full-text search across arbitrary fields is slow compared to Elasticsearch. If your use case is "audit team needs to search a year of logs for a specific user ID across 20 apps," Loki is not the right tool. Use an indexed backend.

Elasticsearch or OpenSearch for heavy log workloads

When the log volume crosses the terabyte-per-day line or the search patterns are genuinely full-text, we deploy OpenSearch. The storage cost is real and the operational overhead is real, but nothing else matches it for interactive log search at scale.

Strengths: Industry-standard search performance. Huge ecosystem. The tool most SOC analysts already know.

Weaknesses: Expensive at scale. Requires real ops discipline — hot-warm-cold tiering, index lifecycle management, shard sizing, and regular curator runs. The gap between "it works" and "it is well run" is a full-time job.

When we deploy it: When logs are a compliance or investigation tool as much as a debugging tool. SIEM workloads, audit workloads, anything where investigators need full-text search across long retention windows.

Datadog

Datadog is the monitoring tool customers love and CFOs hate. For a mid-sized shop that wants one pane of glass and does not want to run the infrastructure themselves, it is the best commercial option we have deployed. It covers five of the six capabilities on the list above in one product and it is operationally smooth.

Strengths: Excellent coverage out of the box — metrics, APM, logs, synthetics, RUM, incident response. Integration library is vast. UI is the best in the category. Alert routing and on-call integration is mature.

Weaknesses: Cost. Datadog bills by host, by log GB ingested, by metric cardinality, by APM span, and by synthetic test. A cardinality mistake — tagging metrics with a user ID for example — can ten-ex your monthly bill in a week. We have migrated several customers off Datadog for cost reasons; none because it stopped working. If you go with Datadog, appoint someone who owns the cost every month.

When we deploy it: Mid-market customers with 100 to 500 hosts, no dedicated observability team, and a CFO who prefers a predictable vendor bill over hiring two SREs.

Zabbix and LibreNMS

For the older, infrastructure-heavy side of a hybrid environment — the switches, the UPS units, the PDUs, the CRAH units, the generators — Zabbix and LibreNMS remain the right tools more often than they get credit for. Prometheus does not love SNMP, and the workarounds (snmp_exporter with generator files) are miserable compared to Zabbix's native handling.

Strengths: Mature SNMP support. Built-in alerting. Low cost. LibreNMS in particular has a strong auto-discovery story for mixed-vendor network gear.

Weaknesses: The UI is dated. The query story does not match PromQL. These tools live in the background and do their job.

When we deploy them: For customers who have a real data center facility and need facility telemetry integrated with IT telemetry. We run Zabbix alongside Prometheus and feed both into Grafana as data sources so the operations team has one dashboard with everything on it.

OpenTelemetry

Not a tool, a standard, but too important to leave out. OTel is the vendor-neutral instrumentation layer we ask customers to adopt before we deploy any APM product. Instrument once, ship to whichever backend you end up buying or replacing. Migrating from Datadog APM to Grafana Tempo to Honeycomb used to mean re-instrumenting every service. With OTel, you change a collector config.

If you are starting observability instrumentation in 2022, start with OpenTelemetry. Do not let any vendor lock you into their proprietary agent if you can avoid it.

The Tools We Have Migrated Customers Off

A shorter list. Not because these products are bad, but because the fit was wrong for what the customer actually needed.

  • SolarWinds Orion — heavy, expensive, and the 2020 supply chain incident left a lasting taste. Customers on Orion often replace it with a combination of LibreNMS for network monitoring and Prometheus for servers.
  • Nagios (classic) — the original workhorse, but the configuration model and alerting story have been surpassed by everything else on this list. Nagios XI and the commercial forks are fine if you already run them. We do not pick them for new deployments.
  • New Relic (historical) — pricing and data model changes pushed us to Datadog or the open-source stack. We have seen more migrations away from New Relic than toward it in the last three years. The product is not bad; the contract experience was.

What We'd Actually Do

For a new customer build today with a hybrid footprint — one data center, one or two cloud regions, around 300 hosts, 50 network devices, facility telemetry — the stack we deploy most often is:

  • Prometheus plus VictoriaMetrics for metrics, with node_exporter on hosts, service-specific exporters, and Grafana Agent where the workload is ephemeral.
  • Zabbix or LibreNMS for SNMP-heavy devices — network gear, PDUs, UPS, CRAH — fed into the same Grafana.
  • Loki for logs if volume is under about 500 GB/day, OpenSearch if it is higher or if compliance requires full-text search.
  • Grafana Tempo or commercial APM for traces, with OpenTelemetry instrumentation so the decision is reversible.
  • Grafana as the single dashboarding layer with all of the above as data sources.
  • Alertmanager for routing, with integration to the customer's on-call tool of choice — PagerDuty, Opsgenie, or a self-hosted alternative.
  • Synthetic checks from an external vendor — we tend to use Uptime Kuma for small shops and Checkly or a commercial option for serious APM needs.

For customers who do not want to operate any of this and are willing to pay for a fully managed experience, Datadog with cost guardrails from day one is the honest recommendation.

Three Takeaways

  1. No single tool covers all six capabilities well. Assume a stack, not a product. Plan for two or three tools behind a single dashboarding and alerting layer.
  2. Cost is the hidden axis. Datadog bills by everything. Elasticsearch storage grows without bound. Prometheus cardinality explodes if you tag by user ID. Pick an owner for the monthly bill before you pick the tool.
  3. Instrument with OpenTelemetry. Whatever backend you pick today is not the backend you will have in five years. Do not let vendor agents be the reason you cannot migrate.

Talk with us about your infrastructure

Schedule a consultation with a solutions architect.

Schedule a Consultation
Talk to an expert →