Cloud Architecture: Seven Facts We Wish Someone Had Told Us
Cloud architecture decisions made in year one echo for a decade. Here are the seven things we learned the hard way, shared so you don't have to.

I have been designing cloud architectures since before "cloud architect" was a job title. In that time I have been wrong about a lot of things, and watched smart people be wrong about a lot of other things. Some mistakes are only expensive. Some mistakes take years to reverse. This is a list of the seven things I wish someone had told me early, in the hope that somebody reading this can save themselves a cycle.
None of these are "shocking" in the marketing sense. They are shocking in the sense that every engineer who has lived through them has the same reaction: "I cannot believe nobody told me this before I made the decision."
1. The Availability Zone Is Not the Unit of Failure You Think It Is
The public story is that availability zones are independent. Separate buildings, separate power, separate networking. Build across two AZs and you survive a single-AZ failure. This is true at the hardware level and misleading at the operational level, because the control plane is shared. When the control plane has a bad day — IAM, DNS, the API you use to launch instances — all your AZs go down at the same time, because they are all driven by the same control plane.
What to actually plan for
Design for regional failures, not just zonal failures. Have a runbook that can bring your critical services up in a different region, on a cold or warm standby, in under an hour. Test it. The regional failure mode is rare but not rare enough to ignore, and when it happens it lasts longer than you want it to.
2. Egress Bandwidth Is the Tax That Compounds
Everybody knows cloud egress is expensive. Almost nobody models it properly in the initial architecture. The model assumes you will serve your customers from inside the cloud, and that most of your traffic will be between services within the same VPC, which is free. This is true for most applications on day one. By year three, you have added a data warehouse that pulls from production, a reporting service that exports to a customer's S3 bucket, a backup that writes to a different region, a monitoring pipeline that ships logs to a SaaS vendor, and a CDN that pulls origin data from the cloud every time there is a cache miss. Every one of these is an egress line item, and they compound.
The unsexy fix
Put the things that generate a lot of egress behind a CDN — not just the user-facing traffic, but the service-to-service traffic when the service is outside the cloud. Use VPC endpoints for the managed services that support them. Measure the bandwidth line on your bill every month and ask hard questions when it grows faster than revenue.
3. The Default VPC Is a Trap
Every hyperscaler gives you a default VPC that is pre-configured with public subnets, an internet gateway, and no real segmentation. It is designed for fast starts, and it works for that. It is also designed to keep you from thinking about networking, and the result is that a lot of production workloads end up in the default VPC, exposed directly to the internet, with security group rules that were written in a hurry and never revisited.
What the real thing looks like
A VPC per environment. Private subnets for everything that does not need to accept public traffic, with egress through a NAT gateway or a transit gateway. Public subnets only for load balancers and bastion hosts. A dedicated VPC for shared services — monitoring, logging, DNS, identity — peered or transit-gateway-connected to the workload VPCs. A CIDR plan that leaves room for growth and does not collide with on-prem networks. Plan the network diagram on a whiteboard before you touch a console.
4. Your First Kubernetes Cluster Will Be Your Worst
Kubernetes is a tool that rewards long-term engagement and punishes casual users. The first cluster you build will have bad defaults, insufficient monitoring, no autoscaling strategy, no resource limits on pods, no network policies, and a node pool that is either too small or too big. This is not because you are bad at it. It is because Kubernetes has a learning curve that you cannot skip, and the first pass through the learning curve produces a cluster that is not what you want.
The honest advice
If your team has never run Kubernetes in production, do not build your own cluster. Use a managed offering — EKS, AKS, GKE — and use an opinionated platform on top of it — Flux, ArgoCD, an internal developer platform — that encodes decisions you have not learned how to make yet. Plan to rebuild the cluster from scratch in year two, because by then you will know what you should have done the first time, and the cost of carrying the mistakes forward is higher than the cost of a migration.
5. Cost Attribution Is a First-Class Requirement, Not a Finance Problem
You will eventually need to answer the question "which team is responsible for this cost?" and you will either have the answer on day one or you will spend six months retrofitting it. Tagging policies must be enforced at resource creation time — not audited after the fact — and the tags must be mandatory, not optional. Every resource gets an owner tag, an environment tag, and a cost center tag. Resources without them do not get created.
What happens if you skip this
You hit $100K a month in cloud spend, the CFO asks for a breakdown, you cannot produce one, you hire a FinOps consultant who spends a quarter tagging your environment retroactively, and you discover that 30 percent of your spend is coming from a team that no longer exists or a project that was canceled 18 months ago. This happens roughly every time, and it is completely preventable with a tag enforcement policy on day one.
6. The "Just Use Managed Services" Advice Has a Hidden Cost
Managed services are genuinely great. They solve real problems. They also create a kind of lock-in that is harder to reason about than "we are on AWS" — they create lock-in to specific behaviors and limits of a specific service that may change under you or may not exist at all when you try to move to a different cloud.
The trade-off to think about
Every managed service you adopt is a bet that the operational savings now are worth more than the flexibility you give up later. For some services this is an obvious win — S3 is a bet every company should take. For some it is more subtle — DynamoDB's single-digit-millisecond latency is wonderful and also the reason you cannot move the application to any other cloud without a rewrite. Decide deliberately which bets you are making, and which services you want to keep portable.
7. The Most Expensive Mistake Is the One That Looks Free
Free tiers, credits, startup programs, and "no commitment" pricing all have a gravitational pull that distorts architecture decisions. Engineers adopt a service because it is free for the first year, and by the time they are paying for it, the application depends on it in a way that is impossible to undo cheaply. This is not a bug in the programs — it is the entire point of the programs, from the vendor's perspective. They work.
The discipline that helps
Evaluate every new service as if you were paying full list price from day one. If you would not adopt it at full price, do not adopt it on credits either. If you would adopt it at full price, the credits are just a nice bonus. This is a discipline that feels overly conservative in the moment and looks obviously correct in hindsight, which is the signature of most good engineering advice.
The Point of All This
Cloud architecture is not about picking the right set of services. It is about making decisions that your future self will thank you for, under conditions that your future self cannot predict. The seven items above are all failure modes I have watched happen to smart teams. Knowing about them in advance does not guarantee you will avoid them, but it meaningfully shifts the odds. Good luck out there.
Talk with us about your infrastructure
Schedule a consultation with a solutions architect.
Schedule a Consultation