Cloud Networking Without the Hand-Waving: A Practical Reference Architecture
A concrete reference architecture for hybrid cloud networking — hub and spoke, transit, private connectivity, DNS, and egress — written from the wiring closet side of the diagram, not the PowerPoint side.

Cloud networking diagrams have a consistent problem: they look clean in the slide deck and then fall apart the first time a real packet has to get from a database in Azure to a printer in a Kansas branch office. The pretty hub-and-spoke pictures on vendor blogs skip the parts that cost you sleep — DNS resolution, asymmetric routing, MTU, private endpoints versus service endpoints, transit cost, and the three different meanings of "private." This post is the reference architecture we actually deploy for mid-market customers with a mix of on-prem, one or two cloud regions, and branch sites. No hand-waving.
The Design Goals
Every design is a set of tradeoffs. Here are the goals that drive ours, in priority order:
- Deterministic routing. Every source-destination pair has exactly one expected path, and every path is documented. Asymmetric routes are the single most common cause of intermittent cloud networking problems, and they come from sloppy design.
- Private means private. When we say traffic is private, we mean it never traverses the public internet and never hits a public IP on either end. "Private" in cloud vendor documentation sometimes means "has a private IP but routes out through a NAT gateway." That is not private. Be precise.
- DNS is a first-class part of the design. More hybrid networking failures trace to DNS than to routing. Plan DNS before you plan subnets.
- Egress is a cost center you design around. Cloud egress bills will surprise you if you do not design for them. Transit and inter-region traffic are worse.
- You can lose any single link or device without losing connectivity. HA is a requirement, not a feature.
The Reference Topology
For a customer with a primary data center, one or two cloud regions, and a handful of branches, the topology we deploy more often than not looks like this:
On-prem core: A pair of routers — typically Cisco, Juniper, or Arista depending on the shop — running BGP internally. Everything else on-prem learns routes from these through OSPF or static with BFD for fast failure detection.
Cloud hub: In Azure, an Azure Virtual WAN hub or a self-managed hub VNet running a pair of NVA firewalls. In AWS, a Transit Gateway with inspection VPCs. In GCP, a Network Connectivity Center hub. The hub is the one place traffic transits between zones of trust, and it is where inspection, logging, and east-west policy enforcement happens.
Spokes: Application VNets or VPCs, each peered to the hub, never directly to each other. Spoke-to-spoke traffic transits the hub. This is the rule that keeps routing deterministic and prevents the peering mesh that nobody can troubleshoot two years later.
Site-to-site: ExpressRoute (Azure) or Direct Connect (AWS) as primary, with a backup IPsec tunnel over the internet. Both legs run BGP. The private circuit carries the bulk of the traffic; the tunnel is warm standby with BFD and a route-preference weighting that fails over automatically. Two providers is better than one if budget allows.
Branches: SD-WAN to the cloud hub, or a simpler IPsec tunnel home to the core for smaller sites. The SD-WAN decision is about operational complexity and link diversity, not about raw capability — most modern routers can do what an SD-WAN appliance does, but the centralized management and automatic path selection is worth the license fee at five sites and above.
The Three Meanings of "Private"
This is where most customers get surprised, so it is worth being precise about terminology.
Private as in private IP addressing. The workload has a 10.x or 172.16.x or 192.168.x address. This is the weakest meaning. A resource with a private IP can still have its traffic NATed out to the internet and back. RFC1918 addressing is a convention, not a security boundary.
Private as in private link. The cloud provider gives you a private endpoint — an IP in your own VNet that terminates the connection to a managed service (Azure SQL, S3, etc.). Traffic from your workload to the managed service never hits the public internet and never uses a public IP. This is what "private" should mean in a design document. Azure Private Link, AWS PrivateLink, GCP Private Service Connect.
Private as in on a private circuit. ExpressRoute, Direct Connect, Cloud Interconnect. Your traffic leaves your data center on a dedicated or semi-dedicated fiber pair through the carrier and arrives at the cloud provider without ever touching the public internet. This is the strongest form of private and the most expensive. It is also the form that matters most for compliance audits.
A mature hybrid design typically uses all three. Addressing is private throughout, managed service access is via private link, and the site-to-cloud transit is on a private circuit. If any one of those is missing, traffic somewhere is taking a public path and you probably do not realize it.
DNS Is Where Hybrid Dies
If we had to pick one thing to pay obsessive attention to in a hybrid network design, it is DNS. Here is why: most cloud managed services resolve differently from inside the cloud provider than from outside. Azure SQL's private endpoint resolves to a private IP when you query from within the VNet and to a public IP when you query from on-prem — unless you configure a conditional forwarder or a private DNS zone correctly. Get it wrong and your on-prem application connects to the public endpoint, traffic exits through your egress, and your expensive private link does nothing.
The pattern we deploy:
- Cloud-side private DNS zones for each cloud-native service (privatelink.database.windows.net, privatelink.blob.core.windows.net, etc.), linked to the hub VNet and all spoke VNets.
- On-prem conditional forwarders for each of those zones pointing to a pair of cloud-resident DNS resolvers (Azure DNS Private Resolver, Route 53 Resolver endpoints). One in each AZ or region for redundancy.
- On-prem DNS remains authoritative for on-prem zones. The cloud resolvers have conditional forwarders back for anything under those zones.
- No split-horizon tricks unless you are absolutely forced into them. Split horizon DNS in hybrid is debuggable on a whiteboard and miserable in a 3 AM incident.
Test resolution from both sides of every trust boundary before you go to production. Actually test it, with dig or nslookup, from a VM in the spoke and a workstation on-prem. Do not trust that it works because the design document says it should.
Routing — Symmetric or Suffer
Asymmetric routing is the single most common source of "it works sometimes" cloud networking bugs. A packet goes one way through a firewall; the reply comes back a different way and hits a firewall that has no state for the connection; the firewall drops it; the application times out. The application owner files a ticket; the network team cannot reproduce it; everyone is sad.
The defense is discipline in the route advertisements:
- One path per prefix. Each destination prefix has one preferred path, with backup paths set at worse local-preference or longer AS-path.
- Consistent filtering. If you filter a prefix on one ExpressRoute circuit, filter it on the other. Do not let one side learn a prefix the other side does not.
- Symmetric NAT state. If traffic has to traverse a firewall in one direction, it has to traverse the same firewall in the other direction. Stretch active-active firewall clusters do not solve this by themselves — the flow needs to land on the same cluster member both ways.
- BFD everywhere. Fast failure detection on every BGP session means failover in under a second instead of the 90-second default timer.
Egress and Transit — Design for the Bill
Cloud networking is cheap to provision and expensive to operate. Egress out to the internet is billed. Egress between regions is billed. Egress to peered VPCs can be billed. Traffic through a transit gateway is billed per GB on top of the hourly attachment fee. We have seen customers get a $40,000 monthly surprise because a log shipper in one region was streaming to an ELK cluster in another region over inter-region transit.
Habits that keep the bill in check:
- Keep data where the processing happens. Do not process in us-east-1 and store in us-west-2 unless there is a specific reason.
- Use private endpoints for intra-cloud service access. Traffic to a managed service via private link is usually cheaper than the same traffic over a public endpoint through a NAT gateway, and it is definitely more secure.
- Log egress is egress. SIEM forwarders, metrics shippers, and backup jobs can move terabytes a month. Meter them before they run wild.
- Tag everything and bill it back. FinOps is easier when the team generating the traffic sees the bill.
What We'd Actually Do
For a mid-market customer with two data centers, a single Azure region, and ten branches, the build we deploy most often is: ExpressRoute primary with an IPsec backup, a Virtual WAN hub for east-west inspection, private endpoints for every managed service, private DNS zones with conditional forwarders both ways, SD-WAN for the branches, and a documented one-path-per-prefix routing plan reviewed quarterly. Nothing exotic. All documented. All testable.
Three Takeaways
- Be precise about "private." Private IP, private link, private circuit are three different things, and your compliance and cost story depends on which ones you actually have.
- DNS is the ground truth. Design and test hybrid DNS before you design anything else. Most mystery cloud failures are DNS failures.
- Symmetric routing is not optional. If you cannot draw a single expected path for every source-destination pair, you do not have a design — you have a collection of peerings.
Talk with us about your infrastructure
Schedule a consultation with a solutions architect.
Schedule a Consultation