Skip to main content
Networking

Cloud Network Performance: Tools That Find the Real Bottleneck

When a cloud app feels slow, the problem is almost never where the dashboard points — here's the toolkit we use to find the actual bottleneck.

John Lane 2023-03-04 6 min read
Cloud Network Performance: Tools That Find the Real Bottleneck

Every cloud performance issue starts the same way. A user says "it feels slow," a developer says "the database is fine," the ops team says "the VMs are fine," and everybody shrugs at the SaaS APM dashboard. Then we get the call. In most of these engagements the real problem is the network, and the real problem inside the network is something nobody was monitoring.

Here are the six tools we reach for first when chasing cloud network performance problems, and what each one is actually good at.

1. VPC Flow Logs (and Their Cousins)

Start here. AWS VPC Flow Logs, Azure NSG Flow Logs, and GCP VPC Flow Logs all record packet metadata at the subnet or interface level. What they do not do, by default, is get sent anywhere useful. Turn them on, ship them to S3 or Blob, and then ship them into an analytics tool — Athena, Log Analytics, or BigQuery work fine.

The single most valuable query you can run is "top talkers by byte count, grouped by source-destination pair, over the last hour." In most investigations this surfaces something unexpected — a runaway backup job, an ML training run pulling from a remote bucket, or a misconfigured replication loop. It is shocking how often the answer is sitting in flow logs that nobody had queried.

Flow logs also tell you what was rejected, which is how you find the NSG rule that is silently blocking half your traffic during a failover event.

2. Network Performance Monitor / Reachability Analyzer

The hyperscalers all offer some flavor of a "can A reach B?" tool. AWS calls it Reachability Analyzer. Azure has Connection Monitor. GCP has Connectivity Tests. They trace the path through security groups, route tables, and peering connections, and they tell you exactly which hop is dropping the packet.

We use these to debug the classic "I added a new subnet and now nothing works" problem. It used to take 20 minutes of reading route tables and NACLs by hand. Now it takes 30 seconds of clicking. Use them. Most teams forget they exist.

The limit of synthetic reachability

These tools test configuration, not actual packet flow. They will happily tell you that A can reach B when a stateful firewall is about to drop the session because of a conntrack issue. For live troubleshooting you still need the next tool.

3. Packet Capture at the Cloud Edge

Both AWS (VPC Traffic Mirroring) and Azure (Network Watcher Packet Capture) let you mirror traffic from a cloud interface to an analyzer. This is the cloud equivalent of a physical tap on a switch port. You start a capture, reproduce the problem, and then stare at the pcap in Wireshark.

We use this sparingly because it is expensive and noisy, but it is the only tool that gives you the truth when the truth is weird. Recent example: a customer had an API that worked fine for most clients but randomly timed out for one specific partner. Flow logs showed the traffic arriving. The application logs showed nothing. A ten-minute packet capture showed the TCP window dropping to zero and the client retransmitting until the session died. Root cause was an MTU mismatch at the partner's edge firewall. Nothing else would have caught that.

4. Synthetic Probes: Consistent, Cheap, Boring

Synthetic monitoring is the unsexy hero of cloud networking. A small script, running every minute from three or four vantage points (inside the VPC, on-prem, a residential edge, and a competing cloud), measuring round-trip time and HTTP response time to your key endpoints, and feeding the results into a time-series database.

We use a combination of Prometheus blackbox exporter and a couple of cheap Hetzner VMs in different regions. Total cost is under $15 per month. The value is enormous — when a real user complains that the site is slow, we can look at the probe data and see whether the problem is (a) real and widespread, (b) real but localized to the user's ISP, or (c) not real and probably a client-side issue. That triage alone saves hours per incident.

5. Cloud-Native APM with a Trace View

Datadog, New Relic, Dynatrace, and the various OpenTelemetry-based open-source equivalents all offer distributed tracing. A good trace view shows you where time is spent in a request — application code, database, cache, network hops to dependencies. Most performance problems we chase turn out to be one external API call that takes 800 ms sitting in the middle of a chain of calls that should total 50.

OpenTelemetry is the standard we recommend adopting as soon as possible. It is vendor-neutral, the instrumentation libraries are mature, and you can route traces to any backend. Starting OTel in a greenfield project is easy. Retrofitting it is painful but worth it.

What the trace won't show you

The trace view will tell you "this database call took 400 ms" but it will not tell you whether the 400 ms was query time, network latency, connection establishment, or TLS handshake. For that breakdown you need either a DB-aware APM agent or the next tool.

6. eBPF and Host-Level Visibility

eBPF tools — Pixie, Cilium Hubble, Inspektor Gadget, and the classic bcc-tools — run in the kernel of your cloud VMs and observe network activity at a level below what flow logs can see. They show you things like "which process opened this TCP connection, how long did the handshake take, and what was the retransmit count." On Kubernetes this is particularly useful because it sees into pod-to-pod traffic that is invisible at the VPC level.

We use Cilium with Hubble on most of the Kubernetes clusters we manage. The observability alone is worth the switch from flannel or calico, and the policy model is better besides. Pixie is a nice lighter-weight option if you just want visibility without rewiring your CNI.

The Investigation Playbook

When a performance issue comes in, here is the order we work through these tools:

  1. Synthetic probes. Is this real? Is it widespread? When did it start?
  2. APM / tracing. Where is the time going in the request lifecycle? Which component is slow?
  3. Flow logs. Is there unexpected traffic volume? Are there rejected packets? Are the top talkers what we expect?
  4. Reachability analyzer. Did a configuration change silently break a path?
  5. eBPF / host tools. What is the process, socket, and kernel view?
  6. Packet capture. When all else fails, get the truth off the wire.

Going in this order saves hours. The common mistake is to jump to packet capture immediately because it feels thorough. In practice you end up with a 4 GB pcap that nobody has time to analyze when the answer was visible in flow logs in 30 seconds.

The Bill You Should Budget For

Cloud observability has a real cost. A realistic budget for a mid-market customer running a production workload:

  • Flow logs to S3 or Blob: $20 to $200 per month depending on volume.
  • APM with tracing (OTel backend or SaaS): $200 to $2,000 per month.
  • Synthetic probes: $10 to $100 per month.
  • eBPF tooling: free (open source) plus the compute overhead, which is small.
  • Packet captures: pennies per incident.

The total usually lands between $300 and $2,500 per month. That is cheap compared to one prolonged outage, and every one of those tools earns its cost in the first year.

The best time to instrument is before you need it. The second-best time is during the outage. Don't wait for the second time.

Talk with us about your infrastructure

Schedule a consultation with a solutions architect.

Schedule a Consultation
Talk to an expert →