Against Scale
Expert Iteration Beats Frontier Compute on Bounded Agent Tasks
Abstract
The dominant thesis of the last five years is that capability scales with compute, parameters, and data — a thesis that justifies an industry of billion-dollar training runs. On bounded task distributions — the kind a personal agent actually faces — this thesis is wrong. Expert iteration on a 4B base, run on a single MacBook over a weekend, matches frontier cloud models that are 50–400× its size. We ran the experiment. Three cycles, ~3 hours each, $0 in cloud compute, 15.3% → 94.9% on a real-world browser agent benchmark. This paper makes the position explicit: for bounded agency, scale is not the moat. Trajectories are.
The scaling orthodoxy
The scaling narrative rests on three justifications: emergence (capabilities appear only at scale), sample efficiency (larger models need fewer examples per skill), and generality (one huge model solves everything). Each is defensible in the open-ended regime. None of them survive the reframe from "capabilities in principle" to "capabilities this user actually needs this week."
The experiment
Setup: a single Apple M5 Max MacBook. Base model: Gemma-4 4B. Cloud compute budget: $0. Benchmark: 59 real end-to-end browser tasks drawn from consented user traces (not synthetic) plus a WebArena-lite subset of 35 tasks for external comparability.
Expert iteration method
Each cycle is a ReST-style loop. (1) Roll out: the current model attempts all 59 tasks; trajectories are recorded. (2) Filter wins: only trajectories that succeeded are kept — no human labels, no reward model, success is defined by the task's own terminal condition. (3) Fine-tune: SFT on wins, mixed with baseline at 3:1 for tool use and 2:1 for knowledge. LoRA rank 16, α=32, 800 iterations, LR 1e-5. ~2–3 hours per cycle. (4) Fuse + evaluate: the LoRA is fused into the base; if metrics improve, it becomes the next base.
Results progression
Base: 15.3%. Cycle 1 (3h): 64.4%. Cycle 2 (6h total): 86.4%. Cycle 3 (9h total): 94.9%. The final model matches or beats every public frontier model we tested — including 175B-class and 70B-class systems — on the same benchmark, despite being 4B parameters trained on a laptop.
Parameters × performance
Our final model sits at ~450× fewer parameters than GPT-4 class and matches its 93% on this benchmark. Against 70B-class open-weights systems at roughly 72%, it's an outright win. On open-ended generation or long-horizon research the picture remains different; here the task distribution is bounded, and that changes the economics.
Against scale — three claims
Narrow: small model + expert iteration dominates frontier + prompting on bounded agent tasks. Actionable: the recipe (4B base, own scoring function, a laptop) is within reach of any practitioner; no billion-dollar run is required. Strong: the marginal return on scale for bounded agency is effectively zero. The moat is not parameters. The moat is the trajectory pipeline the user or operator controls.
What this does not claim
We are not claiming scale is never useful. Open-ended generation, long-horizon reasoning, and tasks that require extrapolation beyond the user's corpus still benefit meaningfully from scale. The claim is narrower: on the distribution of tasks a personal agent actually encounters, scale is a lagging indicator.
What this implies
The interesting economic asset stops being the frontier training run and starts being the tight loop between a real user, a scoring function, and a small base running on that user's own hardware. That is an asset any operator can build — and it compounds on every laptop it runs on.
Results
Expert-iteration cycles on a single Apple M5 Max MacBook, Gemma-4 4B.
| Cycle | E2E (59) | WebArena-lite (35) | Wall-clock | Cloud $ |
|---|---|---|---|---|
| Base (Gemma-4 4B) | 15.3% | — | 0 h | $0 |
| Cycle 1 | 64.4% | 71.4% | ~3 h | $0 |
| Cycle 2 | 86.4% | 85.7% | ~6 h | $0 |
| Cycle 3 | 94.9% | 91.4% | ~9 h | $0 |
Cite
@misc{mingllm2026againstscale,
title={Against Scale: Expert Iteration Beats Frontier Compute on Bounded Agent Tasks},
author={MingLLM Research},
year={2026},
url={https://mingllm.com/papers/against-scale}
}