Architecture2026MingLLM Research

One Model, Many Surfaces

A Unified Base for Voice, Web, and Code Agents

Abstract

We argue that the right factorization for a personal agent is one base model, per-surface adapters — not one agent per product. A single 4B base, `saturday-4b`, drives all three MingLLM surfaces: voice (Jarvis), web (Tensor), and code (Tensor Code). Surface specialization comes from (a) small LoRA adapters trained per surface, (b) a routing head that dispatches to the right adapter based on the user's current surface and the nature of the request, and (c) a shared memory backbone (see The Orb Knows). This keeps capability accretive: a skill learned in one surface lifts the others, memory is common, and inference cost stays fixed at 4B.

4.0 B

Base params

~26 M

Voice adapter (LoRA)

~28 M

Web adapter (LoRA)

~32 M

Code adapter (LoRA)

Why one base

Three problems fall out naturally from the unified factorization. First, shared context: if the agent speaks to you and then helps you browse, the conversation doesn't restart — it's the same model with the same memory picking up a different adapter. Second, a single training loop: improvements to the base propagate to every surface for free, instead of being forked three ways. Third, predictable inference cost: one 4B forward pass regardless of which surface you're on.

Architecture

saturday-4b is a 4.0B-parameter transformer trained for general planning, tool use, and language. Each surface owns a small LoRA adapter: ~26M params for voice (turn-taking, disfluency, terse output), ~28M for web (DOM and accessibility-tree understanding, form filling), ~32M for code (repo context, diffs, test-running). A 6M-parameter router head selects the active adapter from the current surface plus a shallow read of the user's request.

Per-surface results

On internal benchmarks, voice turn-taking F1 rises from 0.62 at the base to 0.89 with the voice adapter. Web E2E browser tasks go from 15.3% to 94.9% with the web adapter. Code SWE-lite goes from 22.5% to 67.5%. Each adapter is small enough to ship alongside the base and swap in under 100 ms on device.

Cross-surface transfer

The most interesting observation is that skills learned in one surface transfer across. Training the web adapter on planning-heavy browsing tasks lifted voice turn-taking F1 by 7.3 points with no voice-side retraining. The unified base is the shared bus across which capability accrues.

Why not a larger single model?

A 27B dense model does not fit on the user's laptop at comfortable latency. A 27B MoE with ~4B active parameters does — that is our target for the next base (Gemma 4 27B MoE). Critically, the MoE pattern composes with this approach: the unified-base + per-surface-adapter factorization is orthogonal to dense vs sparse, and gets better as the base gets more capable.

Limitations

The router can misroute on genuinely ambiguous inputs — e.g., a natural-language request issued while the user is visibly in a code editor. We are evaluating multi-adapter mixing (weighted composition rather than hard selection) for cases where the user's surface is under-specified or the request straddles two surfaces.

Results

Base vs surface-adapter on per-surface evaluation.

Surface	Benchmark	Base	With adapter
Voice · Jarvis	Turn-taking F1 (internal)	0.62	0.89
Web · Tensor	E2E browser tasks (59)	15.3%	94.9%
Web · Tensor	WebArena-lite (35)	—	91.4%
Code · Tensor Code	SWE-lite (40 tasks)	22.5%	67.5%

Cite

@misc{mingllm2026surfaces,
  title={One Model, Many Surfaces: A Unified Base for Voice, Web, and Code Agents},
  author={MingLLM Research},
  year={2026},
  url={https://mingllm.com/papers/surfaces}
}

← All papers MingLLM →