The Query That Ate the Warehouse
And How We Starved It. 41 minutes → 10 minutes. 75% cheaper. Zero application changes.
The problem
It started with a number on a cost dashboard. A single query. 41 minutes of runtime. Over a terabyte of data scanned. Nearly $30 in compute credits (Snowflake's billing unit — each credit costs real money). Per. Run.
Not per day. Per run. And these things were running dozens of times a day.
"The first step to solving a problem is recognising there is one." — Anonymous, but accurate
The query was a profile report — a large analytical query that pulls audience data from multiple independent data sources and assembles it into a single result. Think of it as a puzzle with 13 pieces. The problem wasn't the pieces. The problem was how they were being assembled: one at a time, in sequence, waiting for each previous piece before picking up the next.
The query was built as a series of UNION ALL branches (a SQL pattern that stacks results from multiple queries end to end) — one branch per data source. Thirteen data sources. Thirteen branches. Running back to back.
Why Snowflake couldn't fix this on its own
Snowflake has a genuinely impressive query optimizer. Skipping data it doesn't need, pushing filters down early, eliminating irrelevant chunks before they're read — it does a lot of clever work behind the scenes. But there are things it fundamentally will not do.
It will not rewrite your query structure.
If you hand it a sequential UNION ALL, it runs it sequentially. It's not going to look at 13 completely independent branches and decide to run them in parallel. That's not its job. Its job is to execute the query you gave it as efficiently as possible — not to question whether the query itself is the problem.
"Garbage in, garbage out applies to query design too. The optimizer can polish a bad query. It cannot reinvent it."
On top of that, the query had no help for the optimizer at all. No sorted identifiers to guide partition pruning. No pre-filtering to eliminate rows early. No hints. Just raw branches scanning everything, hoping for the best. The optimizer was doing its best with one hand tied behind its back.
What we actually did about it
The insight, once you see it, is embarrassingly obvious: these branches don't need each other. They are completely independent. Data source A doesn't care what data source B is doing. So why are they waiting in line like it's a Monday morning coffee queue?
The approach was straightforward:
1 Split the monolith
Decompose the single giant UNION ALL query into individual branch queries. Each data source gets its own query. They stop waiting for each other.
2 Optimise each branch
With each branch isolated, apply targeted optimisations: sorted identifiers to help Snowflake's pruning, pre-filtering to eliminate rows before the heavy lifting, and structure hints so the optimizer knows exactly where to look.
3 Run everything in parallel
Fire all branches simultaneously across available compute. Instead of 13 operations waiting in sequence, 13 operations running at the same time. Total runtime becomes whatever the slowest single branch takes — not the sum of all branches.
4 Collect and aggregate
Each branch writes its results to a temporary holding table (called a transient table in Snowflake). Once all branches complete, a final aggregation step stitches everything back together. Same output. Completely different path to get there.
This logic was packaged into a reusable program stored in the database (a stored procedure) — RUN_PARALLEL_COMPOSITION — that can take any query structured this way, decompose it automatically, execute in parallel, and aggregate the results. No manual rewriting. No application changes. Feed it a query, get back an optimised result.
The numbers
| Metric | Original Query | Optimised Query |
|---|---|---|
| Execution time | 41.9 minutes | 10.5 minutes |
| Speedup | — | 4x faster |
| Data scanned | 1,100 GB | 1,117 GB |
| Credits consumed | 11.2 credits | 2.8 credits |
| Estimated cost | ~$30 per run | ~$7.50 per run |
| Cost reduction | — | 75% lower |
Same warehouse. Same data. Same output. Just a different execution model.
"We didn't scan less data. We just stopped making the data wait in line."
The part where Claude did the heavy lifting
Here's the part I enjoy telling the most.
The query analysis, the bottleneck identification, the execution plan investigation, the prototype stored procedure, the parallel execution framework — all of it was built in collaboration with Claude. I didn't spend nights manually tracing execution plans. I didn't write the SQL from scratch. I described the problem, shared the structure, and worked through it iteratively with an AI that could read execution plans, suggest optimisation techniques, and write working implementations while I validated the approach.
What the human did
- Identified that cost was anomalous and worth investigating
- Understood the business context — what the query was for and why it mattered
- Validated that the parallel approach produced correct results
- Made the architectural call on execution model
- Decided when the solution was good enough to ship
What Claude did
- Analysed the query structure and identified the sequential bottleneck
- Proposed and tested optimisation techniques for each branch
- Built the
RUN_PARALLEL_COMPOSITIONstored procedure - Ran experiments across multiple query variations
- Generated the performance comparison data
Not a single line of the stored procedure was written by hand. The entire prototype went from concept to working code in a single session. The gap between "I understand the problem" and "here is production-ready code" collapsed to almost nothing.
That's not AI replacing engineers. That's AI making engineers unreasonably effective.
What you can steal from this
- Look at your sequential UNION ALL queries. If the branches don't depend on each other, they have no business waiting for each other.
- Help Snowflake's optimizer. Sort your identifiers. Pre-filter early. Don't make it guess what you need.
- Think about parallelism at query design level — not just at the warehouse level. Scaling up the warehouse helps when many queries run at once. Parallel execution helps individual query performance. They're different tools.
- Use transient tables as checkpoints. Free to store, fast to write, and they make parallel aggregation trivial.
- Describe your problem to an AI before you start writing code. You might be surprised how fast you get to a working solution when you're not the one writing every line.
Case study by Gautam Marya.
Disclosure: This page was written with Claude (Anthropic). The analysis, the approach, and the architectural decisions are Gautam's. Claude built the implementation. That's the whole point.