Position2026MingLLM Research

Against Scale

Expert Iteration Beats Frontier Compute on Bounded Agent Tasks

Abstract

The dominant thesis of the last five years is that capability scales with compute, parameters, and data — a thesis that justifies an industry of billion-dollar training runs. On bounded task distributions — the kind a personal agent actually faces — this thesis is wrong. Expert iteration on a 4B base, run on a single MacBook over a weekend, matches frontier cloud models that are 50–400× its size. We ran the experiment. Three cycles, ~3 hours each, $0 in cloud compute, 15.3% → 94.9% on a real-world browser agent benchmark. This paper makes the position explicit: for bounded agency, scale is not the moat. Trajectories are.

94.9%

Final E2E score (59 tasks)

15.3%

Starting score

Cloud cost

450× fewer params

vs GPT-4 class

The scaling orthodoxy

The scaling narrative rests on three justifications: emergence (capabilities appear only at scale), sample efficiency (larger models need fewer examples per skill), and generality (one huge model solves everything). Each is defensible in the open-ended regime. None of them survive the reframe from "capabilities in principle" to "capabilities this user actually needs this week."

The experiment

Setup: a single Apple M5 Max MacBook. Base model: Gemma-4 4B. Cloud compute budget: $0. Benchmark: 59 real end-to-end browser tasks drawn from consented user traces (not synthetic) plus a WebArena-lite subset of 35 tasks for external comparability.

Expert iteration method

Each cycle is a ReST-style loop. (1) Roll out: the current model attempts all 59 tasks; trajectories are recorded. (2) Filter wins: only trajectories that succeeded are kept — no human labels, no reward model, success is defined by the task's own terminal condition. (3) Fine-tune: SFT on wins, mixed with baseline at 3:1 for tool use and 2:1 for knowledge. LoRA rank 16, α=32, 800 iterations, LR 1e-5. ~2–3 hours per cycle. (4) Fuse + evaluate: the LoRA is fused into the base; if metrics improve, it becomes the next base.

Results progression

Base: 15.3%. Cycle 1 (3h): 64.4%. Cycle 2 (6h total): 86.4%. Cycle 3 (9h total): 94.9%. The final model matches or beats every public frontier model we tested — including 175B-class and 70B-class systems — on the same benchmark, despite being 4B parameters trained on a laptop.

Parameters × performance

Our final model sits at ~450× fewer parameters than GPT-4 class and matches its 93% on this benchmark. Against 70B-class open-weights systems at roughly 72%, it's an outright win. On open-ended generation or long-horizon research the picture remains different; here the task distribution is bounded, and that changes the economics.

Against scale — three claims

Narrow: small model + expert iteration dominates frontier + prompting on bounded agent tasks. Actionable: the recipe (4B base, own scoring function, a laptop) is within reach of any practitioner; no billion-dollar run is required. Strong: the marginal return on scale for bounded agency is effectively zero. The moat is not parameters. The moat is the trajectory pipeline the user or operator controls.

What this does not claim

We are not claiming scale is never useful. Open-ended generation, long-horizon reasoning, and tasks that require extrapolation beyond the user's corpus still benefit meaningfully from scale. The claim is narrower: on the distribution of tasks a personal agent actually encounters, scale is a lagging indicator.

What this implies

The interesting economic asset stops being the frontier training run and starts being the tight loop between a real user, a scoring function, and a small base running on that user's own hardware. That is an asset any operator can build — and it compounds on every laptop it runs on.

Results

Expert-iteration cycles on a single Apple M5 Max MacBook, Gemma-4 4B.

Cycle	E2E (59)	WebArena-lite (35)	Wall-clock	Cloud $
Base (Gemma-4 4B)	15.3%	—	0 h	$0
Cycle 1	64.4%	71.4%	~3 h	$0
Cycle 2	86.4%	85.7%	~6 h	$0
Cycle 3	94.9%	91.4%	~9 h	$0

Cite

@misc{mingllm2026againstscale,
  title={Against Scale: Expert Iteration Beats Frontier Compute on Bounded Agent Tasks},
  author={MingLLM Research},
  year={2026},
  url={https://mingllm.com/papers/against-scale}
}

← All papers MingLLM →