One Model, Many Surfaces
A Unified Base for Voice, Web, and Code Agents
Abstract
We argue that the right factorization for a personal agent is one base model, per-surface adapters — not one agent per product. A single 4B base, `saturday-4b`, drives all three MingLLM surfaces: voice (Jarvis), web (Tensor), and code (Tensor Code). Surface specialization comes from (a) small LoRA adapters trained per surface, (b) a routing head that dispatches to the right adapter based on the user's current surface and the nature of the request, and (c) a shared memory backbone (see The Orb Knows). This keeps capability accretive: a skill learned in one surface lifts the others, memory is common, and inference cost stays fixed at 4B.
Why one base
Three problems fall out naturally from the unified factorization. First, shared context: if the agent speaks to you and then helps you browse, the conversation doesn't restart — it's the same model with the same memory picking up a different adapter. Second, a single training loop: improvements to the base propagate to every surface for free, instead of being forked three ways. Third, predictable inference cost: one 4B forward pass regardless of which surface you're on.
Architecture
saturday-4b is a 4.0B-parameter transformer trained for general planning, tool use, and language. Each surface owns a small LoRA adapter: ~26M params for voice (turn-taking, disfluency, terse output), ~28M for web (DOM and accessibility-tree understanding, form filling), ~32M for code (repo context, diffs, test-running). A 6M-parameter router head selects the active adapter from the current surface plus a shallow read of the user's request.
Per-surface results
On internal benchmarks, voice turn-taking F1 rises from 0.62 at the base to 0.89 with the voice adapter. Web E2E browser tasks go from 15.3% to 94.9% with the web adapter. Code SWE-lite goes from 22.5% to 67.5%. Each adapter is small enough to ship alongside the base and swap in under 100 ms on device.
Cross-surface transfer
The most interesting observation is that skills learned in one surface transfer across. Training the web adapter on planning-heavy browsing tasks lifted voice turn-taking F1 by 7.3 points with no voice-side retraining. The unified base is the shared bus across which capability accrues.
Why not a larger single model?
A 27B dense model does not fit on the user's laptop at comfortable latency. A 27B MoE with ~4B active parameters does — that is our target for the next base (Gemma 4 27B MoE). Critically, the MoE pattern composes with this approach: the unified-base + per-surface-adapter factorization is orthogonal to dense vs sparse, and gets better as the base gets more capable.
Limitations
The router can misroute on genuinely ambiguous inputs — e.g., a natural-language request issued while the user is visibly in a code editor. We are evaluating multi-adapter mixing (weighted composition rather than hard selection) for cases where the user's surface is under-specified or the request straddles two surfaces.
Results
Base vs surface-adapter on per-surface evaluation.
| Surface | Benchmark | Base | With adapter |
|---|---|---|---|
| Voice · Jarvis | Turn-taking F1 (internal) | 0.62 | 0.89 |
| Web · Tensor | E2E browser tasks (59) | 15.3% | 94.9% |
| Web · Tensor | WebArena-lite (35) | — | 91.4% |
| Code · Tensor Code | SWE-lite (40 tasks) | 22.5% | 67.5% |
Cite
@misc{mingllm2026surfaces,
title={One Model, Many Surfaces: A Unified Base for Voice, Web, and Code Agents},
author={MingLLM Research},
year={2026},
url={https://mingllm.com/papers/surfaces}
}