The Query That Ate the Warehouse

And How We Starved It. 41 minutes → 10 minutes. 75% cheaper. Zero application changes.

The problem

It started with a number on a cost dashboard. A single query. 41 minutes of runtime. Over a terabyte of data scanned. Nearly $30 in compute credits (Snowflake's billing unit — each credit costs real money). Per. Run.

Not per day. Per run. And these things were running dozens of times a day.

"The first step to solving a problem is recognising there is one." — Anonymous, but accurate

The query was a profile report — a large analytical query that pulls audience data from multiple independent data sources and assembles it into a single result. Think of it as a puzzle with 13 pieces. The problem wasn't the pieces. The problem was how they were being assembled: one at a time, in sequence, waiting for each previous piece before picking up the next.

The query was built as a series of UNION ALL branches (a SQL pattern that stacks results from multiple queries end to end) — one branch per data source. Thirteen data sources. Thirteen branches. Running back to back.

The brutal math: If each branch takes 3 minutes on average, a 13-branch sequential query takes 39 minutes. Run 50 of those a day and you've spent 32 hours of compute time on something that could finish in 3 minutes. That's not a query problem. That's a queueing problem.

Why Snowflake couldn't fix this on its own

Snowflake has a genuinely impressive query optimizer. Skipping data it doesn't need, pushing filters down early, eliminating irrelevant chunks before they're read — it does a lot of clever work behind the scenes. But there are things it fundamentally will not do.

It will not rewrite your query structure.

If you hand it a sequential UNION ALL, it runs it sequentially. It's not going to look at 13 completely independent branches and decide to run them in parallel. That's not its job. Its job is to execute the query you gave it as efficiently as possible — not to question whether the query itself is the problem.

"Garbage in, garbage out applies to query design too. The optimizer can polish a bad query. It cannot reinvent it."

On top of that, the query had no help for the optimizer at all. No sorted identifiers to guide partition pruning. No pre-filtering to eliminate rows early. No hints. Just raw branches scanning everything, hoping for the best. The optimizer was doing its best with one hand tied behind its back.

What we actually did about it

The insight, once you see it, is embarrassingly obvious: these branches don't need each other. They are completely independent. Data source A doesn't care what data source B is doing. So why are they waiting in line like it's a Monday morning coffee queue?

The approach was straightforward:

1 Split the monolith

Decompose the single giant UNION ALL query into individual branch queries. Each data source gets its own query. They stop waiting for each other.

2 Optimise each branch

With each branch isolated, apply targeted optimisations: sorted identifiers to help Snowflake's pruning, pre-filtering to eliminate rows before the heavy lifting, and structure hints so the optimizer knows exactly where to look.

3 Run everything in parallel

Fire all branches simultaneously across available compute. Instead of 13 operations waiting in sequence, 13 operations running at the same time. Total runtime becomes whatever the slowest single branch takes — not the sum of all branches.

4 Collect and aggregate

Each branch writes its results to a temporary holding table (called a transient table in Snowflake). Once all branches complete, a final aggregation step stitches everything back together. Same output. Completely different path to get there.

This logic was packaged into a reusable program stored in the database (a stored procedure) — RUN_PARALLEL_COMPOSITION — that can take any query structured this way, decompose it automatically, execute in parallel, and aggregate the results. No manual rewriting. No application changes. Feed it a query, get back an optimised result.

The numbers

Metric	Original Query	Optimised Query
Execution time	41.9 minutes	10.5 minutes
Speedup	—	4x faster
Data scanned	1,100 GB	1,117 GB
Credits consumed	11.2 credits	2.8 credits
Estimated cost	~$30 per run	~$7.50 per run
Cost reduction	—	75% lower

Same warehouse. Same data. Same output. Just a different execution model.

"We didn't scan less data. We just stopped making the data wait in line."

The part where Claude did the heavy lifting

Here's the part I enjoy telling the most.

The query analysis, the bottleneck identification, the execution plan investigation, the prototype stored procedure, the parallel execution framework — all of it was built in collaboration with Claude. I didn't spend nights manually tracing execution plans. I didn't write the SQL from scratch. I described the problem, shared the structure, and worked through it iteratively with an AI that could read execution plans, suggest optimisation techniques, and write working implementations while I validated the approach.

What the human did

Identified that cost was anomalous and worth investigating
Understood the business context — what the query was for and why it mattered
Validated that the parallel approach produced correct results
Made the architectural call on execution model
Decided when the solution was good enough to ship

What Claude did

Analysed the query structure and identified the sequential bottleneck
Proposed and tested optimisation techniques for each branch
Built the RUN_PARALLEL_COMPOSITION stored procedure
Ran experiments across multiple query variations
Generated the performance comparison data

Not a single line of the stored procedure was written by hand. The entire prototype went from concept to working code in a single session. The gap between "I understand the problem" and "here is production-ready code" collapsed to almost nothing.

That's not AI replacing engineers. That's AI making engineers unreasonably effective.

What you can steal from this

Look at your sequential UNION ALL queries. If the branches don't depend on each other, they have no business waiting for each other.
Help Snowflake's optimizer. Sort your identifiers. Pre-filter early. Don't make it guess what you need.
Think about parallelism at query design level — not just at the warehouse level. Scaling up the warehouse helps when many queries run at once. Parallel execution helps individual query performance. They're different tools.
Use transient tables as checkpoints. Free to store, fast to write, and they make parallel aggregation trivial.
Describe your problem to an AI before you start writing code. You might be surprised how fast you get to a working solution when you're not the one writing every line.

← Back to case studies

Case study by Gautam Marya.

Disclosure: This page was written with Claude (Anthropic). The analysis, the approach, and the architectural decisions are Gautam's. Claude built the implementation. That's the whole point.