From Tokens to Tools: Where LLMs Go Next

For a while now, progress in language models has meant one thing: stack more layers, train on more tokens, use more GPUs. The results have been impressive. We’ve seen better fluency, longer coherence, and stronger performance across benchmarks. But if you squint, you start to see the cracks. Models still hallucinate. They forget what just happened. They struggle with multistep tasks, or anything that involves real memory, planning, or control.

It’s starting to feel like the “next token prediction” framing is holding us back. It got us here. It won’t get us much further.

Models Don’t Use Tools. Humans Do.

When people work, they don’t keep everything in their head. You don’t memorize a spreadsheet, you open one and you don’t remember every function signature because your IDE helps with that. Real work happens through tools: text editors, design programs, databases, search engines. These tools hold state. They let us see, edit, rewind, debug.

LLMs today don’t use tools. They just output text, one token at a time, into the void. That’s fine for writing a poem. It breaks down when you need to coordinate steps, track variables, or build anything with internal structure.

A good idea would probably be, instead of directly generating final text, the model should operate by issuing structured commands to external tools such as APIs, browsers, file systems, simulators, robots, and receiving structured responses in return. This lets the model work with real high resolution state, rather than forcing everything through some token by token output.

This creates a loop. The model sends a command, sees what happens, sends another. Think REPL, not monologue. The task becomes less about fluent generation and more about making decisions in a tight feedback loop. That’s a much closer match to how humans get real work done.

Tool Loops Need Memory. And Memory Is Expensive.

If the model is working across many steps, it needs to remember everything that happened: each call, each response, each correction. Suddenly, you’re asking it to hold an entire workflow in memory. Standard Transformer architectures scale poorly here due to the O(N²) cost of self-attention relative to sequence length N. Double the context, quadruple the cost. That doesn’t fly when you're looping through hundreds or thousands of steps.

To make tool driven workflows viable, we need architectures whose compute grows O(N) with context length. Not just because it's elegant, but because it’s necessary. Otherwise, tool use becomes too expensive to scale.

State-space models like Mamba look a little more promising here. They're built to process long sequences more efficiently. Maybe they’re not the final answer, but I think they are in the right direction. Sparse attention is another approach, because not everything needs high precision. Sometimes it's fine to scan most of the context lightly, while zooming in on key parts.

The goal is simple: make it cheap for models to remember what matters over long horizons.

Next-Token Training Doesn’t Teach Tool Use

If the model’s job is no longer to predict the next word, then next-word training becomes a mismatch. Tool use is structured. It's about sequences of actions, not sequences of tokens. Think: edit a line of code, rerun a script, observe the output, branch the logic.

So we need training data that reflects this structure. Logs from real user interactions: programming sessions, robot commands, spreadsheet macros, UI traces. These are the traces of people thinking through tools. They're messy, they're structured, and they’re exactly what a tool using model should learn from.

Structured outputs are easier to debug. If the model fails, you don’t have to interpret a fog of tokens. You can inspect the call, the parameters, the response. The whole process is legible. That makes safety auditing and debugging dramatically easier.

How to Validate Real World Performance

Phase 1 – Compress
Take your best model. Shrink it without killing performance. This gives you something fast and cheap to iterate with.
Phase 2 – Stress Test
Now plug in the experimental bits: new layers, tool interfaces, training data. Run it in systems that simulate deployment conditions. Look for bugs, interactions, bottlenecks.
Phase 3 – Scale
Once you’ve seen what works, scale up. Train the big version. Deploy with confidence. You’ve already seen how the parts behave under pressure.

This loop ensures you don’t gamble large compute budgets on ideas that haven’t been tested under load.

Where This All Leads

Transformer scaling has taken us far, but it’s nearing the end of its solo act. The future will be less about raw generation and more about interaction, with memory, with state, with tools, with systems.

We’re moving toward agentic models. Ones that plan. Ones that revise. Ones that operate more like collaborators than autocomplete engines.

The future isn’t just about bigger models. It’s about models that can work. That can think over time. That can act through tools. We’re not done scaling. We’re just starting to scale the right things.

If you're only scaling parameters but not changing structure, you're already behind.