The LLM Stack Nobody Talks About: The Real Infrastructure Behind Text-to-SQL Accuracy

I recently spoke with a couple of AI startups. One pitched a tool that turns plain English into SQL and then shows live data reports. It sounded promising. Honestly, it felt suspiciously clean, so I decided to pause and check it out myself before moving forward.

What I found wasn’t really about the model itself. It was about the infrastructure around it.

The Idea vs. The Actual Thing

The concept is simple: ask “What was our revenue last quarter?” and an LLM instantly generates the correct SQL. No analyst needed, no delays, no friction.

Yeah, well, didn't happen. At least, not quite like that.

LLMs are good at parsing natural language. SQL isn’t natural language. SQL is structured, relational, and highly specific. You can’t guess your way through table joins or column names, and you can’t infer domain specific conventions from plain English. When I tried it myself, handing the model just our raw schema in the prompt, I got something like 2 out of 60 queries correct. That's roughly 3%. Not exactly reassuring.

This issue isn't due to a flaw in the model itself. The model lacked critical context. Creating effective SQL isn’t just about deploying a chatbot to generate SQL from thin air. You actually need to build a careful ecosystem around it that provides the model with schema details, previous queries, and documentation to yield accurate results.

Why LLMs Struggle with SQL

The gap becomes clear once you look at what these models were trained on: books, articles, web text. Notice the lack of "your company's SQL schema" anywhere in there. LLMs don’t know your internal column names, foreign key complexities, or how you specifically calculate revenue. So the model ends up hallucinating columns and breaking joins, confidently guessing in the dark.

Getting good SQL out of a model requires two things:

Schema structure (including column names, relationships, and table semantics)
Practical query examples (real world usage patterns and conventions)

LLMs don’t have either. If you give the model just your schema definitions and no usage examples, it has basically zero shot.

Retrieval is Everything (and Examples Too)

When I carefully experimented, it became obvious:

Using Raw Schema Only:** ~3% accuracy
Add a few static example queries → jumps to maybe 30–40%, but things break down pretty fast.
Then add retrieval, pulling in genuinely relevant queries and docs from past usage, and suddenly accuracy shoots up to around 80%+.

Context wins. Every time. But you need the right kind of context. Good retrieval isn't about dumping static examples into a prompt. It's about carefully searching a vector store for genuinely relevant historical SQL queries and metadata, such as real usage patterns, column explanations, and practical join logic. More queries (high quality, actually executed, known working) means better retrieval.

Here's roughly what worked best:

Embed the user’s natural language question.
Retrieve useful past questions, schema info, internal docs from a vector DB.
Build a targeted prompt by carefully combining these retrieved fragments.
Generate SQL with the LLM.
Execute the SQL.
Check if the result makes sense (aka validation).

This isn't the standard rag you’ve probably come across, aiming to find answers hidden inside stored sentences. Instead, the retrieval guides the model through concrete, historically correct usage patterns. By embedding past SQL statements, explanations of schema fields, and join logic notes, like metadata, not data, we guide the LLM on how similar queries have been successfully handled before.

Validation is Brittle and Needs Automation

One important step I want to highlight is validation. Right now many teams validate the generated SQL manually. This isn't great or scalable. Just because SQL runs doesn't mean the results are sensible. We've seen plenty of smoothly executed queries quietly joining the wrong column, returning gibberish. Validation must catch that.

Ideally, validation becomes fully automated, but honestly that's not trivial. False positives are problematic, and complexity grows quickly. You can't build a product on manual validation forever, so solving automated validation remains a necessary (and tough) goal.

What’s Actually Under the Hood

When someone says their system gets 80% accuracy, they’re not merely “using GPT.” What’s running under the surface is usually something like this:

Retrieval services surgically finding exactly the right pieces of context
Prompt building systems meticulously assembling just right inputs for the models
Secure and monitored SQL execution engines
Result validation layers ensuring sanity (currently often manual, but inevitably automated soon)
Schema ingestion pipelines continually keeping contexts fresh
Telemetry and analytics tooling tracking performance metrics such as correctness, latency, and errors.

The model doesn't create magic alone. It's sitting quietly at the center, waiting on context from all these moving parts around it.

Open Problems (and Annoying Edge Cases)

The current setup scales well to medium complexity databases. But we’re not done. A few open problems worth thinking about:

Can accuracy reliably approach 100% without expensive fine tuning?
How to handle multistep queries or ambiguous requests gracefully?
How do we reliably automate validation without false positives?
Should retrieval/context modules be standardized or plug-and-play?

And let's also acknowledge the pain points around query complexity clearly here: complex joins, large schema with subtle foreign keys, these all remain annoyingly hard. Retrieval alone doesn't really solve them since the pieces often don't neatly fit together or require stronger reasoning.

Ideas for Actually Improving Accuracy

I've been thinking more about how to concretely boost accuracy, and realized the earlier attempt was too abstract. Here's a rougher but clearer mental model of what makes sense here:

How to actually compare retrieval methods

A good idea would probably be measuring retrieval methods on full end-to-end SQL accuracy against a clean held out query set. Comparing retrieval quality just by looking at embedding cosine similarities alone doesn’t really work. Yes, you still use those similarity scores internally to pull relevant examples in but does retrieval A beat retrieval B because its cosine distances "look better"? You don't know. The final SQL accuracy is what actually matters, not embedding scores in isolation.

Prompting and fine tuning: what's actually worth trying?

Fine tuning directly on database specific usage might give accuracy a boost. That said, some evidence seems to suggest retrieval augmented prompting can even outperform fine tuning at least for SQL generation. It's unclear how the costs compare here, and we don't have exact overhead numbers, but rag style methods (heavy prompting with carefully chosen examples) definitely look promising and deserve exploration.

Either way, exploring both options, and combinations of the two, makes sense. There's clearly room to experiment more with these different context delivery strategies.

Getting serious about benchmarking

Systematic benchmarking is important, but to be fair, some folks have already set up pretty structured evaluations. They've run hundreds of trials across carefully chosen datasets, test questions, and various retrieval methods. So it's not exactly the Wild West, there’s clearly systematic evaluation happening.

But even those structured setups note that validation often involves "shades of grey" and still requires manual checks. Automation might improve that situation, and it seems reasonable to think about creating clearer evaluation scripts or maybe even a dedicated benchmarking test app. Something like this would likely help teams consistently test retrieval strategies, measure their full pipeline accuracy cleanly, and reduce the manual effort currently involved.

Conclusion

Text-to-SQL only works because of the stack around the model. The model is blind without context. The actual system involves embeddings, retrieval, prompt construction, secure execution, validation, and telemetry. That’s what makes it usable. If the demo looks effortlessly clean, that's because someone built this entire layer underneath. The model is only one part of the equation. The rest of the stack is the real reason text-to-SQL ever works.