Skip to main content
Cloud Strategy

VPC Design Patterns That Survive Production

Six VPC design decisions you'll regret if you skip them — based on cleaning up the ones where people did.

John Lane 2022-07-12 6 min read
VPC Design Patterns That Survive Production

Most VPCs in production today were designed by someone who was in a hurry, following a quickstart guide, and assuming they'd "fix the networking later." Later never comes. A year in, the VPC is carrying real workloads, the CIDR blocks are wrong, the subnets are the wrong size, and every new project has to bolt on more NAT gateways and peering connections to avoid disrupting the production mistake.

This post is the set of decisions we make up front, before we deploy the first EC2 instance or Azure VM into a new virtual network. These are not novel. They are boring. The fact that they're boring is why people skip them, and why we keep getting hired to clean up the result.

Start With Addressing, Not Services

Pick a CIDR block that leaves room for your future self

The single most common mistake we see is a VPC provisioned with a /24 or /23 because that was the default or that's what the tutorial said. Then three projects later the team is trying to peer four VPCs that all overlap on 10.0.0.0/24 and the whole thing has to be rebuilt.

Our default is a /16 for each primary VPC, non-overlapping across the entire organization, with a documented allocation plan. If you're running multi-account or multi-subscription, every account gets a pre-allocated non-overlapping range from a master address plan. Write this down before you deploy anything. An IPAM tool (AWS IPAM, Azure VNet Manager, or even a carefully maintained spreadsheet) is not optional at any real scale.

Subnet by function and availability zone, not by project

Subnets should be organized by two axes: what runs in them (public, private-app, private-data, management) and which AZ they're in. That's it. Do not create a subnet for "Project Alpha" and another for "Project Beta." Projects come and go. The subnet structure has to outlive them.

A healthy small VPC looks like three public subnets (one per AZ), three private-app subnets, three private-data subnets, and maybe a dedicated management subnet for bastion hosts and admin tooling. That's 9 to 12 subnets in a /16, each sized at /22 or /20. Plenty of headroom, predictable routing, and the ability to tell at a glance where any instance lives.

Six VPC Tactics That Actually Matter in Production

1. Treat public subnets as load balancer real estate only

Nothing with state, nothing with secrets, nothing with a shell should live in a public subnet. Public subnets exist to host load balancers, NAT gateways, and VPN endpoints. Every actual application workload runs in a private subnet with no direct internet route. This is table stakes, but we still find customers with production databases in public subnets with security groups as their only defense. Don't.

2. Egress should go through a single controlled path

Every private subnet should egress through a known, logged, inspectable path. NAT gateway in AWS, NAT gateway or Azure Firewall in Azure. For any serious environment, the right answer is a central egress VPC or hub with a proper firewall (Palo Alto, Fortinet, or the cloud-native equivalents) that does egress filtering, URL categorization, and DNS inspection. This is where compliance, incident response, and data exfiltration defense all live. An egress with no inspection is a hole in your network you can't see through.

3. DNS is load-bearing and usually forgotten

Private DNS is the part of your network that will take down your application in the most confusing way possible. Decide up front: are you using AWS Route 53 private hosted zones, Azure Private DNS zones, or a pair of self-hosted BIND or Windows DNS servers? If hybrid, are you forwarding between your on-prem DNS and the cloud resolvers in both directions? Conditional forwarders configured? Split-horizon working correctly?

Every production incident involving "the app can't find the database" comes back to DNS. Design it first, document it, and test failure modes before you need them.

4. Use security groups for identity, NACLs for policy

Security groups in AWS (and their Azure NSG equivalent) are the right tool for workload-to-workload access control. Use them extensively, make them narrow, and reference them by group instead of by IP whenever possible. NACLs are a blunt instrument — use them for broad subnet-level deny rules (block known-bad ranges, deny east-west between tiers) and nothing else. Mixing the two and putting specific application rules in NACLs creates a maintenance nightmare that nobody will remember in a year.

5. Separate your environments with accounts or subscriptions, not VPCs

Do not put prod and dev in the same VPC. Do not even put them in the same AWS account or Azure subscription. The blast radius of a mistake in a shared account is too large, and the IAM/RBAC story is too subtle to trust. Prod gets its own account, dev gets its own account, shared services gets its own account, log archive gets its own account. Connect them with Transit Gateway or Azure VWAN, not VPC peering.

Peering is fine for two-party connections. The moment you have three or more VPCs that need to talk, go straight to a hub-and-spoke transit architecture. The routing stays simple, the security inspection stays centralized, and you can add new spokes without rewiring everything.

6. Flow logs on from day one, stored somewhere you can actually query

VPC flow logs cost almost nothing to enable and are essential for incident response, capacity planning, and egress audit. Turn them on before you deploy a workload, send them to S3 or a log analytics workspace, and set up at least a minimal query capability (Athena, Log Analytics KQL, or a proper SIEM). The day you need them is the day you wish you'd turned them on last year.

The Things That Get Cut and Shouldn't

Three line items get cut from VPC designs under schedule pressure, and all three are exactly the ones that matter.

Private endpoints / PrivateLink. Routing S3, DynamoDB, storage account, and key vault traffic through the public internet (even if "it's encrypted") sends your data out of your trust boundary unnecessarily, increases latency, and burns NAT gateway dollars. Private endpoints for your cloud provider's core services should be the default, not the upgrade.

A separate VPC for shared services. DNS, AD domain controllers, centralized logging, and the jump hosts you use for administration belong in their own VPC with its own lifecycle. Scattering domain controllers across production VPCs means you can never touch the production VPC without risking shared infrastructure.

A tested DR runbook for the VPC itself. If your entire VPC becomes unusable — a bad route table change, a compromised NAT gateway, a regional outage — what's the plan? Infrastructure-as-code that can redeploy the VPC from scratch in a clean account is the answer. If you can't rebuild your VPC from code in under an hour, you don't actually have a DR story, you have a hope.

What We Actually Deliver

When we design a VPC for a customer, the deliverable is Terraform (or Bicep, if it's an Azure-only shop) that creates the whole thing from scratch, including the transit gateway, the central egress, the flow logs, the private endpoints, and the private DNS. The address plan is documented, the subnet allocations are in the code as comments, and the security groups come in a baseline module that gets extended per application.

The whole thing deploys in about 15 minutes. Tearing it down and rebuilding it is a routine part of testing. That's the bar — if you can't rebuild it cleanly, you haven't actually designed it.

Three Takeaways

  1. Design the address plan before you deploy anything. CIDR mistakes cascade into years of pain. A one-hour exercise up front is worth a ten-day rebuild later.
  2. Hub-and-spoke beats peering the moment you have three VPCs. Centralize egress, centralize inspection, centralize DNS. The spokes stay simple.
  3. Flow logs, private endpoints, and infrastructure as code are not optional. They are the difference between a VPC you can operate and one you are afraid to touch.

Talk with us about your infrastructure

Schedule a consultation with a solutions architect.

Schedule a Consultation
Talk to an expert →