A healthcare SaaS company came to us with a simple problem: their AWS bill had grown from $12K/month to $38K/month in 18 months, but their user base had only doubled. Something was scaling linearly when it should have been sublinear.
Their VP of Engineering put it bluntly: "We have 50,000 users and we're spending $38K/month. Our competitor has 200,000 users and spends less than us. What are we doing wrong?"
After a 2-week audit and 6 weeks of implementation, we brought their bill down to $18,200/month — a 52% reduction, saving $237,600/year. Here's every single thing we changed.
The Audit
We started by categorizing spend by AWS service:
| Service | Monthly Cost | % of Total |
|---|---|---|
| EC2 (EKS nodes) | $16,400 | 43% |
| RDS (PostgreSQL) | $7,200 | 19% |
| ElastiCache (Redis) | $3,100 | 8% |
| S3 + CloudFront | $2,800 | 7% |
| NAT Gateway | $2,600 | 7% |
| Data Transfer | $2,400 | 6% |
| EBS Volumes | $1,800 | 5% |
| Other | $1,700 | 5% |
| Total | $38,000 | 100% |
Every single line item had optimization potential. Let's go through them.
1. EC2/EKS: Right-Size + Spot + Karpenter ($16,400 → $6,800)
This was the biggest win. Their EKS cluster was running on m5.2xlarge On-Demand instances because "that's what the AWS Quick Start guide suggested."
Changes:
- Replaced Cluster Autoscaler with Karpenter
- Added 15 instance types to the allowed list
- Moved 75% of workloads to Spot instances
- Right-sized every pod based on 2 weeks of VPA data
- Added HPA to all stateless services
We wrote about the Karpenter setup in detail in our Karpenter + Spot + Scale-to-Zero post.
Savings: $9,600/month (58%)
2. RDS: Reserved Instances + Read Replicas ($7,200 → $3,600)
They were running a db.r6g.2xlarge PostgreSQL RDS instance — On-Demand, Multi-AZ. The database was at 15% CPU utilization on average.
Changes:
- Downsized to
db.r6g.xlarge(CPU was only hitting 40% during peak with the smaller instance) - Purchased a 1-year All Upfront Reserved Instance (42% discount)
- Added a read replica for analytics queries that were hammering the primary
- Moved nightly batch jobs to hit the replica instead of primary
-- Before: analytics queries on primary
SELECT date_trunc('day', created_at), count(*)
FROM patient_records
WHERE created_at > now() - interval '90 days'
GROUP BY 1;
-- After: same query routed to read replica via connection string
-- analytics_db_url = postgres://replica-endpoint:5432/healthdb
Savings: $3,600/month (50%)
3. ElastiCache: Right-Size + Reserved ($3,100 → $1,400)
Running cache.r6g.xlarge with 3% memory utilization. They were caching session data for 50K users — that fits in a cache.r6g.large with room to spare.
Changes:
- Downsized to
cache.r6g.large - Purchased 1-year Reserved Instance
- Implemented TTL on all cache keys (they had 2M keys with no expiry)
Savings: $1,700/month (55%)
4. NAT Gateway: The Silent Budget Killer ($2,600 → $800)
This one surprised everyone. NAT Gateway charges $0.045/GB for data processing — and their pods were pulling Docker images through NAT on every deploy.
Changes:
- Configured ECR VPC endpoints (no more NAT for image pulls)
- Added S3 VPC endpoint (logs and backups were going through NAT)
- Configured STS and CloudWatch VPC endpoints
- Moved non-essential traffic to instances with public IPs
# VPC Endpoints we added
aws ec2 create-vpc-endpoint \
--vpc-id vpc-xxx \
--service-name com.amazonaws.us-east-1.ecr.api \
--vpc-endpoint-type Interface
aws ec2 create-vpc-endpoint \
--vpc-id vpc-xxx \
--service-name com.amazonaws.us-east-1.s3 \
--vpc-endpoint-type Gateway
Savings: $1,800/month (69%)
NAT Gateway costs are one of the most overlooked line items in AWS bills. Every company we audit is overpaying for NAT.
5. S3 + CloudFront: Lifecycle Policies + Compression ($2,800 → $1,600)
They were storing every version of every file forever. Medical document uploads from 3 years ago were still in S3 Standard.
Changes:
- S3 Intelligent-Tiering for all buckets (auto-moves cold data to cheaper tiers)
- Lifecycle policy: move to Glacier after 1 year for compliance archives
- Enabled CloudFront compression (Brotli) — reduced bandwidth 40%
- Configured proper cache headers — CDN hit ratio went from 60% to 94%
{
"Rules": [
{
"ID": "ArchiveOldDocuments",
"Status": "Enabled",
"Transitions": [
{
"Days": 90,
"StorageClass": "STANDARD_IA"
},
{
"Days": 365,
"StorageClass": "GLACIER"
}
]
}
]
}
Savings: $1,200/month (43%)
6. Data Transfer: Keep Traffic Inside the VPC ($2,400 → $1,200)
Cross-AZ data transfer charges were eating them alive. Services in us-east-1a were talking to services in us-east-1c, paying $0.01/GB each way.
Changes:
- Configured topology-aware routing in Kubernetes (prefer same-AZ)
- Moved chatty services into the same AZ
- Compressed inter-service payloads (gRPC with protobuf instead of JSON)
# Topology-aware routing
apiVersion: v1
kind: Service
metadata:
name: user-service
annotations:
service.kubernetes.io/topology-mode: Auto
Savings: $1,200/month (50%)
7. EBS Volumes: Delete Orphans + Change Types ($1,800 → $800)
22 unattached EBS volumes sitting there doing nothing. PersistentVolumes from deleted pods that nobody cleaned up. Classic.
Changes:
- Deleted 22 orphaned EBS volumes (saved $400/month immediately)
- Changed GP2 volumes to GP3 (20% cheaper, better performance)
- Reduced snapshot frequency from hourly to daily for non-critical volumes
# Find orphaned volumes
aws ec2 describe-volumes \
--filters Name=status,Values=available \
--query 'Volumes[*].{ID:VolumeId,Size:Size,Created:CreateTime}' \
--output table
Savings: $1,000/month (56%)
The Final Scorecard
| Service | Before | After | Savings | % |
|---|---|---|---|---|
| EC2/EKS | $16,400 | $6,800 | $9,600 | 58% |
| RDS | $7,200 | $3,600 | $3,600 | 50% |
| ElastiCache | $3,100 | $1,400 | $1,700 | 55% |
| NAT Gateway | $2,600 | $800 | $1,800 | 69% |
| S3/CloudFront | $2,800 | $1,600 | $1,200 | 43% |
| Data Transfer | $2,400 | $1,200 | $1,200 | 50% |
| EBS | $1,800 | $800 | $1,000 | 56% |
| Other | $1,700 | $2,000 | -$300 | -18% |
| Total | $38,000 | $18,200 | $19,800 | 52% |
"Other" went up slightly because we added monitoring tools (Kubecost, custom exporters) that have a small compute cost. Worth every penny.
Annual savings: $237,600
The entire project — audit, implementation, testing, documentation — took 8 weeks and cost them a fraction of one month's savings.
The Most Important Change
The technical optimizations were important, but the cultural change mattered more. We installed our FinOps dashboard on day one of the project, so the team could see costs in real-time from the start.
By week 3, engineers were coming to us with optimization ideas we hadn't thought of. One developer noticed their service was making 10x more S3 API calls than necessary due to a missing cache layer. Another found a cron job that was spinning up a large instance for 2 minutes every hour.
When you make costs visible, engineers optimize naturally. They just need the data.
AWS bill growing faster than your user base? That's normal — and fixable. Get a free infrastructure assessment and we'll show you exactly where the waste is.