Making 4B models feel like 400B on your laptop

Three moves that matter: expert iteration, selective generation, and tight tool-use shaping. How we get to 94.9%.

The first question we get from engineers is always the same: "How does a 4B model compete with frontier?"

The honest answer is that it competes narrowly. On broad-world knowledge, long-horizon planning, or novel reasoning, a 4B is not a trillion-parameter system and it would be dishonest to claim otherwise. But on the task class we actually care about — personal agency on your own data, on your own machine — a carefully trained 4B gets you 94.9% of the way, and the remaining 5.1% is mostly about knowing when to abstain.

This post walks through the three recipe decisions that mattered most.

1. Expert iteration

The single largest gain we measured came from expert iteration. For 11 generations, we had the model generate task trajectories, scored them with a 70B critic, selected the top decile, and added those to the training set. The key insight was that the critic does not need to be capable of solving the task itself — it only needs to be able to recognize a good solution when it sees one. Recognition is easier than generation, and a 70B recognizer can reliably surface the best outputs of a 4B generator.

Naive supervised fine-tuning plateaus around 86% on our benchmark. After 11 generations of expert iteration the same model reaches 94.9% — an 8.7pp lift from compute we spent on selection, not on scale.

2. Selective generation

The second move was teaching the model when not to try. We trained a small head on top of the base that produces a calibrated confidence band. Below threshold, the agent either asks a clarifying question or hands off. Above threshold, it commits.

This is counterintuitive. "Abstain more" sounds like "do less work." In practice, users strongly prefer a 94.9%-completion agent that fails cleanly to a 99%-completion agent that fails silently. Wrong answers erode trust faster than slow answers.

3. Tight tool-use shaping

The third move was explicit training on tool-call structure. Our JSON function-calling reliability — the rate at which the model produces syntactically valid tool invocations with the right argument shape — rose from 94.3% to 99.1% after a focused SFT pass on synthetic tool-call traces.

99.1% sounds like a small improvement. It is not. The failure mode of an agent is compounding: a single invalid tool call cascades into a confused plan, which cascades into wasted steps, which cascades into abandoned tasks. Raising reliability from "usually works" to "nearly always works" is what turns the product from "fun demo" into "part of my day."

Takeaway

Parameter count is a lagging indicator. What matters in 2026 is how you spend compute, what you teach the model to abstain on, and how reliable the seams between model and tool are. Do those three things well and your 4B feels like frontier, with none of the cloud tax.