The startup had just closed a Series A. 50,000 active users, growing 20% month-over-month, a team of 12 engineers. The AWS bill in month three post-funding: $47,000. The previous month: $8,200.
No security breach. No viral traffic spike. Just a combination of architectural decisions that looked fine individually and were catastrophically expensive together.
Here is what happened, why it happens, and how you stop it before it happens to you.
The Four Culprits
1. Data Transfer Costs
AWS’s data transfer pricing is designed to look free until it isn’t. The basic rule: data coming in is free, data going out costs money. But the details are where teams get surprised.
The startup had a microservices architecture spread across three availability zones. Services in us-east-1a regularly called services in us-east-1b. This is good for reliability - you want services distributed across AZs.
It’s also expensive. Cross-AZ data transfer costs $0.01/GB in each direction. Their services were exchanging about 50 TB/month of data across AZs. That’s:
50,000 GB * $0.01 * 2 directions = $1,000/month
That’s not the big number. The big number was their CDN.
They were serving assets from S3 through CloudFront - correct. But their backend API was also returning data that CloudFront wasn’t caching because of missing cache headers. Every API response was triggering a CloudFront-to-origin request, and that origin was in a different region than the CloudFront edge.
Cross-region data transfer from us-east-1 to CloudFront: $0.085/GB. They were transferring 120 TB. That’s $10,200/month in data transfer alone, from a configuration error that would have taken 30 minutes to fix.
2. Unattached Resources
The infrastructure team had spun up EC2 instances, Elastic IPs, EBS volumes, and RDS snapshots throughout the previous year. Some were for experiments that never shipped. Some were for a feature that got deprecated. Some were leftovers from a migration.
Audit results:
- 18 unattached Elastic IPs: $0.005/hour each = $65/month
- 12 unattached EBS volumes totaling 3 TB: $0.08/GB/month for gp3 = $245/month
- 47 RDS automated snapshots from a deleted database: variable, but accumulating
- 8 stopped EC2 instances (still charged for EBS): $180/month
Total from idle resources: ~$490/month. Not the biggest number, but pure waste.
The pattern is universal: infrastructure accumulates. Without a policy for cleaning up experimental resources, the bill grows with the team size.
3. Misconfigured Auto Scaling
Their application layer used EC2 Auto Scaling with a minimum of 2 instances. The maximum was set to 100. Scale-up policy: add 10 instances when CPU > 70%. Scale-down policy: remove 2 instances when CPU < 30%, with a 20-minute cooldown.
A feature launch caused a traffic spike. CPU hit 80%, Auto Scaling added 10 instances. CPU dropped to 35%. 20 minutes later, Auto Scaling removed 2. CPU stayed at 35%. 20 minutes later, removed 2 more. This continued until baseline of 2. Total time: 100 minutes.
During the next spike 6 hours later, they were back to 2 instances and scaled up to 20 again. This happened repeatedly. The scale-down was too aggressive - they repeatedly went to minimum capacity, then scaled up hard when traffic returned.
The fix is obvious in hindsight: higher minimum (5 instead of 2), smaller scale-up steps, faster scale-down with target tracking instead of step scaling. The cost during the misconfigured period: about $8,000 from overprovisioning during rapid scale up/down cycles.
4. NAT Gateway
This one hits almost everyone eventually. NAT Gateways allow instances in private subnets to access the internet. They charge $0.045/hour per gateway plus $0.045/GB of data processed.
The startup had three NAT Gateways (one per AZ) and their application was pulling software updates, making API calls to third-party services, and streaming logs - all through the NAT Gateway.
Processing charge alone: 85 TB * $0.045 = $3,825/month. Plus hourly charges: 3 gateways * $0.045 * 730 hours = $98.55/month.
The fix: send traffic that doesn’t need to leave AWS (S3, DynamoDB, SQS, etc.) through VPC endpoints instead of the NAT Gateway. VPC endpoints for S3 are free. For other AWS services, interface endpoints cost $0.01/hour but eliminate NAT processing charges. For their workload, this would have reduced NAT costs by roughly 60%.
The AWS Cost Monitoring Stack (That You Need Before This Happens)
| Tool | Cost | What It Does |
|---|---|---|
| AWS Cost Explorer | Free | Historical spending by service, region, tag |
| AWS Budgets | Free (first 2 alerts) | Alert when spending exceeds threshold |
| AWS Cost Anomaly Detection | Free detection + SNS | ML-based anomaly alerts |
| Infracost | Free OSS | Shows cost impact of Terraform changes in PR |
| CloudHealth / Apptio | $100-500/mo | Advanced allocation and optimization |
Cost Anomaly Detection is the one that would have caught the data transfer spike. It sends an alert when spending on any service increases unexpectedly. Enable it. It takes 10 minutes and has prevented incidents at many teams I’ve seen.
The Quick Audit Checklist
Run this monthly:
- Unattached Elastic IPs (any IP not associated with a running instance costs $0.005/hr)
- Unattached EBS volumes
- Old EBS snapshots (set a lifecycle policy)
- Stopped EC2 instances (you still pay for EBS)
- Unused load balancers
- Old CloudWatch log groups with no retention policy
- RDS instances in stopped state (AWS restarts them after 7 days if not addressed)
- NAT Gateway data volume vs. VPC endpoint alternative
Most teams save 15-25% of their AWS bill on the first pass through this checklist.
Bottom Line
The $47,000 bill came from four fixable problems: missing CloudFront cache headers, unattached resources, misconfigured auto scaling, and NAT Gateway processing fees. None required architectural changes - they were configuration errors and cleanup tasks. Set up Cost Anomaly Detection before you need it, tag every resource with team and project, and audit for idle resources monthly. The 2-hour monthly audit saves more money than any infrastructure optimization.
Comments