Why your Aurora Postgres bill doubled in 6 months — and it isn't traffic

Aurora bills don't double because of traffic. They double because of I/O patterns nobody is watching. Here's the autopsy, with the fixes a senior DBA would actually ship.

TL;DR

Aurora bills grow fastest from I/O cost, not compute. A single unindexed query scanning a 200M-row table at 50 RPS will quietly add four figures a month.
The three silent killers: sequential scans on cold buffer pages, idle replicas burning storage I/O for replication, and dev/stage clusters left on r6g.4xlarge because nobody owns lifecycle.
You don't need a bigger instance. You need someone (or something) watching the workload, every minute, and rewriting the top 10 offenders.

The autopsy

Last quarter I sat with a Series B team whose Aurora Postgres bill went from $11k to $24k in six months. Traffic? Up 18%. Bill? Up 118%.

We pulled pg_stat_statements ordered by total_exec_time and immediately found the usual suspects:

SELECT query, calls, total_exec_time, mean_exec_time, rows, shared_blks_read
FROM pg_stat_statements
ORDER BY shared_blks_read DESC
LIMIT 20;

The top offender was a single endpoint:

SELECT u.id, u.email, COUNT(o.id) AS order_count
FROM users u
LEFT JOIN orders o ON o.user_id = u.id AND o.created_at > now() - interval '30 days'
WHERE u.tenant_id = $1
GROUP BY u.id;

Looks innocent. EXPLAIN (ANALYZE, BUFFERS) told a different story:

HashAggregate  (cost=412901.32..412901.42 rows=10 width=44)
  ->  Hash Right Join  (cost=8431.10..401204.55 rows=2339355 width=40)
        Hash Cond: (o.user_id = u.id)
        Buffers: shared hit=1842 read=287431
        ->  Seq Scan on orders o  ...
              Filter: (created_at > now() - '30 days'::interval)
              Rows Removed by Filter: 198_212_443

287,431 buffer reads per call. At 40 calls/min that's ~16M reads/hour landing on Aurora's storage layer. Aurora charges per I/O. Do the math.

The three silent killers

1. Sequential scans on cold pages

This is the one above. Buffer cache hit ratios look fine in aggregate but a handful of queries do all the cold reads. Aurora bills you for every one of them. Adding a partial index — CREATE INDEX CONCURRENTLY ON orders (user_id) WHERE created_at > now() - interval '30 days' — drops the same query to 412 buffer reads. ~98% reduction in I/O cost for that endpoint.

2. Idle replicas you forgot about

That read replica spun up for the analytics migration in March? It's still there. Aurora replicates at the storage layer, so it's not free — and its presence multiplies the storage I/O floor.

Check aurora_replica_status() and the actual read traffic. If a replica is doing <50 QPS, kill it or downsize the cluster.

3. Stage clusters running production sizes

Walk into any 50-engineer company and you'll find a staging-aurora-cluster on db.r6g.4xlarge that handles 3 QPS during business hours. ~$1,100/month for the instance, plus storage I/O on three nodes. Schedule it down outside work hours. Most teams forget you can do this.

Why DBAs don't catch this

Not for lack of skill. The interface is wrong. pg_stat_statements, aurora_stat_db, CloudWatch metrics, and the slow query log live in three different places, none of which surface cost attribution per query. AWS's billing dashboard tells you total I/O cost but not which query caused it.

The gap between "I see this is expensive" and "I know which line of application code to change" is where the money disappears.

What DeepSQL does about this

DeepSQL connects to your Aurora cluster read-only and runs continuously. It maps every active query to its share of I/O and compute cost, so you know — in dollars — which query is eating the bill. It auto-generates the EXPLAIN ANALYZE, proposes the rewrite or the index, and runs a synthetic benchmark against a sampled workload before opening a PR. For idle replicas and oversized stage clusters, it surfaces utilization heatmaps and a one-click downsize plan with the expected savings. On the teams running it today, the typical first-month reduction is ~38% off the managed-database bill — entirely from changes you could have made yourself, if anyone had the time to watch the database 24/7.