A voice-first macOS assistant that actually acts. Talk to your machine; it drives your apps, reads your data, writes your code, and finishes the work. Current baseline: jarvis:saturday-4b. Next target: Gemma 4 27B MoE.
Reads your Apple Calendar, Mail, Notes, and Files — with permission
Drives your browser (via Tensor) and terminal (via Tensor Code)
Runs tasks in the background — you go do other things
A Chrome extension that doesn't just read your tabs — it uses them. Install it once, open a side panel, tell it what you want done. It opens the pages, fills the forms, reads across tabs, and finishes the flow.
Native side panel — not a popup, not a toolbar
Zero-shot form filling and multi-tab research
Bundled services daemon — one install, everything wired up
Your browsing never leaves the machine unless you say so
Three papers on what it takes to build a local agent that actually works — memory, surfaces, and the loop that trains it on your laptop.
Paper I · Memory
The Orb Knows
MingLLM Research · 2026
A persistent, self-consolidating memory system for local agents. Writes while the agent acts; compresses and links on idle — the way sleep consolidates a day.
Paper II · Architecture
One Model, Many Surfaces
MingLLM Research · 2026
A single base model drives voice (Jarvis), web (Tensor), and code (Tensor Code). One mind, three hands — specialization by small adapters, not separate models.
Paper III · Position
Against Scale
MingLLM Research · 2026
A position paper. On bounded agent tasks, expert iteration on a 4B model — run on one laptop over a weekend — matches frontier cloud models. The moat is trajectories, not parameters.
04
Results
The numbers
?
End-to-end browser task pass rate. Pre-training base Gemma-4 achieved 15.3%. After iterative self-improvement, jarvis-gemma-v2-FINAL reached 94.9% on 59 real-world tasks.
0%
E2E Task Pass Rate (was 15%)
?
Performance on WebArena-lite benchmark (35 tasks). Measures ability to navigate and complete multi-step web tasks in a controlled environment.
0%
WebArena-lite
?
Best reinforcement learning cycle (cycle 3) training pass rate. Demonstrates consistent improvement through expert iteration loops.
0%
Best RL Cycle Train
?
Held-out test set pass rate (10 tasks never seen during training). Shows the model generalizes beyond its training distribution.
0%
Held-out Pass
?
Selective risk at α=0.05 significance level from conformal prediction. Means: when the model does answer, it's wrong only 6.53% of the time.
0%
Selective Risk (α=0.05)
?
Total parameter count of the base model (Gemma-4 4B). Small enough to run on consumer Apple Silicon hardware with MLX framework.
0B
Parameters — Fits on a Mac
Model Size vs Performance
GPT-4 (1.8T)
96%
Cloud
Claude 3.5 (175B)
93%
Cloud
Jarvis ★ LOCAL
95%
4B Params
Llama 3.1 8B
72%
8B
Gemma 4 Base
15%
4B
End-to-end browser task pass rate · Lower parameter count = runs on your Mac
How it works
Hear. Plan. Act. Remember.
Every request you make runs the same four steps. The loop closes back on itself — each action sharpens the next.
01
Hear
Voice · text · screen
Jarvis captures what you said — or what's on your screen. Transcription and intent parsing happen on-device.
02
Plan
saturday-4b · router
The base model decomposes the request and the router picks the right surface: Jarvis, Tensor, or Tensor Code.
03
Act
Voice · web · code
The chosen surface executes — drives apps, clicks through the browser, edits and runs code. You see the work as it happens.
04
Remember
The orb · consolidates on idle
Events become episodes; episodes become facts. The next request starts with everything the last one learned.
06
FAQ
Questions, answered
What MingLLM is, what ships, and what's next.
What is MingLLM?
MingLLM builds local AI agents that run on your own hardware. Three products share one intelligence: Jarvis (voice-first macOS assistant), Tensor (Chrome extension that uses the web for you), and Tensor Code (a coding companion CLI).
What is Tensor?
A Chrome extension. You install it once, open a side panel, tell it what you need done on the web. It reads, clicks, fills, synthesizes — multi-tab flows without babysitting. Your browsing stays on your machine.
What hardware do I need?
Jarvis: Apple Silicon (M1+), macOS 14+, 16GB RAM. Tensor: any Chromium-based browser (Chrome, Edge, Brave, Arc). Tensor Code runs anywhere a terminal runs — your keys, your models.
How does self-training work?
Five stages: (1) Interview — the agent asks you about a new capability, (2) Synth — it generates candidate training data, (3) Critic — a stronger model filters for quality, (4) MLX — a LoRA is trained on-device, (5) Hot-swap — the new adapter loads without a restart. Think of it as an orb that keeps getting sharper the more you use it.
What's the current Jarvis baseline?
jarvis:saturday-4b — declared the production baseline 2026-04-12. It doesn't get replaced unless the challenger beats it on real agentic flows. The next training target is Gemma 4 27B MoE.
Is my data private?
On-device by default. Jarvis memory is a local SQLite database. Tensor keeps your browsing on your machine. Cloud calls are opt-in and visible — every outbound request is logged in the side panel.
What's conformal selective generation?
A framework that gives the model a mathematically grounded "I don't know." Sample multiple drafts, measure agreement, abstain when uncertain — with calibrated coverage guarantees. Pairs naturally with the agent loop so Jarvis stops before doing something it isn't confident about.
How do I get access?
Private beta now. Drop your email below. Access rolls out in waves — developers, researchers, and design engineers go first.
07
Voices
Early reactions
Notes from private-beta users. Anonymized.
Told Jarvis "clear my inbox, draft replies to the important ones, book the dentist." Came back to coffee. It had done it all — replies ready for review, appointment on the calendar.
Beta user — Research engineer
Tensor finished an entire vendor-onboarding flow — 23 pages of forms, dropdowns, file uploads — without a single error. Never left my machine. It doesn't feel like an extension anymore, it feels like the browser itself is smart.
Beta user — Ops lead
A 4B model that hits 95% on real browser tasks, trained on a laptop, running on a laptop. The expert-iteration loop is the first thing I've seen that actually compounds.
Beta user — ML researcher
Tensor Code is a claude-code-shaped thing that's mine. It has my context, my keys, my rules, and it ships without asking for permission every three seconds.
Beta user — Design engineer
08
Compare
The family
Three agents. One intelligence. Pick the surface you need.
Jarvis
Tensor
Tensor Code
Surface
Voice · macOS
Chrome extension
CLI / terminal
Voice input
✓
—
—
Browser automation
via Tensor
✓
—
Code authoring
via Tensor Code
—
✓
Terminal execution
✓
—
✓
Calendar / Mail / Notes
✓
—
—
Background tasks
✓
✓
✓
Self-training loop
✓
—
✓
Persistent memory
✓
✓
✓
Data leaves device
Opt-in
Opt-in
Opt-in
Where it runs
Apple Silicon
Any Chromium browser
Any Unix shell
Model
saturday-4b
Local + remote
Your choice
Status
Private beta
Private beta
Beta
09
Milestones
Roadmap
From research to production — our path to building local AI that matters.
MLX Framework
Gemma 4B
Apple Silicon
Conformal Prediction
Expert Iteration
WebKit Automation
SwiftUI
Chrome Extension API
MLX Framework
Gemma 4B
Apple Silicon
Conformal Prediction
Expert Iteration
WebKit Automation
SwiftUI
Chrome Extension API
Q1 2026 — Shipped
Foundation research
Conformal selective generation (CSG) and agentic fine-tuning research. Jarvis goes from 15% → 95% on end-to-end browser tasks on a single fine-tuned 4B model, published the training pipeline.
Shipped
Q2 2026 — Now
Jarvis + Tensor in private beta
Voice-first macOS agent (Jarvis) and Chrome extension (Tensor) both in private beta. Self-training loop live on Jarvis. Tensor Code CLI shipped for developers.
Beta
Q3 2026 — Next
Public launch · bigger brain
Jarvis and Tensor open to anyone. Jarvis upgrades to a larger MoE base for harder multi-step tasks. Tensor gets a research mode that spans 20+ tabs.
Upcoming
Q4 2026 — Horizon
Unified agent fabric
One intelligence across voice, browser, and terminal. Shared context, shared memory, shared tools. Optional Mingeta for Ray-Ban Display — suggestions routed through Telegram to bypass Meta's HUD SDK gate.
Vision
Flexibility
Bring your own model.
Gemma, Llama, Qwen, Phi, Mistral — anything that runs in MLX. Swap the base, the fine-tune, even the provider. Your hardware, your models, your rules.
Careers
Join Us
We're building the future of local AI. Small team, massive ambition. If you're passionate about making AI run everywhere, we want to hear from you.
MingLLM Research · 2026 · Persistent memory for local agents
Abstract
Agents without memory are amnesiac tool-callers; agents with memory are colleagues. We describe the memory subsystem behind Jarvis, Tensor, and Tensor Code: a persistent, self-consolidating store that writes continuously while the agent acts and compresses on idle — analogous to the role sleep plays in biological memory. The system is built on SQLite with FTS5, runs entirely on-device, and exposes its state through an orb interface so the user can see what the agent knows about them. We report p99 recall latency under 40 ms, a >6× improvement in multi-session task success over memoryless baselines on our internal evaluation, and a simple, auditable data model that users can inspect, edit, or delete.
1. Motivation
The current generation of "chat" assistants is stateless by default. Context windows are stuffed on each turn; anything beyond the window is forgotten. This is fine for one-shot answers and catastrophic for agency. A useful agent has to know what you asked yesterday, who "my sister" refers to, which project directory you meant, and whether it already tried a failed approach last Tuesday.
We set three design goals:
Local-first. Memory never leaves the device without explicit opt-in.
Continuous, not batch. The agent writes as it acts, not in a separate ingest job.
Legible. The user can see the memory — the orb visualizes density, growth, and topic clusters.
FTS5 provides BM25 ranking over all three tables without a separate index service. Embedding retrieval is a secondary path computed on-demand, not at write time — a deliberate choice to keep the write path cheap and the system easy to inspect.
3. Idle consolidation
When the agent is idle for more than a 90-second window, a consolidation pass runs:
Summarize the last N events into a single episode row; drop events older than a configurable TTL.
Link the episode back to any related prior episodes via shared entities (people, files, URLs).
Promote any repeated-across-episodes assertion into the facts table.
Forget facts that contradict newer observations; keep the old one as an episode reference, not a live fact.
The consolidator is itself the agent's own model — no separate dedicated model — run at low priority. Users can trigger it manually (/sleep in Jarvis) or inspect its output.
4. Results
Cross-session task success — 10-day window
Memory growth — facts learned over 50 days
Metric
Memoryless baseline
Orb memory
p99 recall latency
—
<40 ms
Cross-session task success (10-day window)
11.2%
72.4%
"Who did I mention last time?" recall
0%
94%
Storage (50 days heavy use)
—
~180 MB
5. Visualizing knowing
The orb interface isn't decorative. Its density encodes fact count, its temperature encodes recency, and its "breathing" corresponds to active consolidation. When the agent sleeps, the orb dims; when it wakes with new facts, it flickers. This gives the user a first-person view into whether the agent has seen them enough to be useful yet — a question that has historically been answered with "trust us."
6. Limitations
Multi-user shared memory (e.g., a household) is out of scope for this version. Fact contradictions currently require manual review if they span more than two episodes. Embedding-based recall is on-demand only; latency drops to 40 ms hold for BM25 queries, not ANN searches.
@misc{mingllm2026orb,
title={The Orb Knows: Persistent, Self-Consolidating Memory for Local Agents},
author={MingLLM Research},
year={2026},
url={https://mingllm.com/papers/orb}
}
Paper · Architecture
One Model, Many Surfaces
MingLLM Research · 2026 · A unified base for voice, web, and code agents
Abstract
We argue that the right factorization for a personal agent is one base model, per-surface adapters — not one agent per product. A single 4B base, saturday-4b, drives all three MingLLM surfaces: voice (Jarvis), web (Tensor), and code (Tensor Code). Surface specialization comes from (a) small LoRA adapters trained per surface, (b) a routing head that dispatches to the right adapter based on the user's current surface and the nature of the request, and (c) a shared memory backbone (see The Orb Knows). This keeps capability accretive: a skill learned in one surface lifts the others, memory is common, and inference cost stays fixed at 4B.
1. Why one base
The naïve approach is a product per model: one LLM for voice, another for browsing, a third for coding. We tried this. The result is three separate context silos, three separate training pipelines, and three models that have to re-learn the same user every Monday. The unified-base approach solves all three:
Shared context. A fact learned by Jarvis is immediately available to Tensor.
Single training loop. Improvements to the base lift all surfaces; improvements to an adapter are cheap and targeted.
Predictable inference cost. The user pays for 4B params, not 12B spread across products.
Each adapter is trained via expert iteration on surface-specific trajectories (see Against Scale for the method). Training is fully independent per surface, so work on the Web adapter doesn't block Voice.
3. Per-surface results
Base model vs adapter — per surface
Architecture — one base, three adapters, one router
Surface
Benchmark
Base (no adapter)
With adapter
Voice · Jarvis
Turn-taking F1 (internal)
0.62
0.89
Web · Tensor
E2E browser tasks (59)
15.3%
94.9%
Web · Tensor
WebArena-lite (35)
—
91.4%
Code · Tensor Code
SWE-lite (internal, 40 tasks)
22.5%
67.5%
4. Cross-surface transfer
Because the base is shared, we see positive transfer: a capability learned in one surface shows up partially in the others without explicit training. For example, training the Web adapter on multi-step planning raised the Voice adapter's task-completion rate by 7.3 points despite no Voice retraining. This is the point — the surfaces are three hands on the same mind.
5. Why not a larger single model?
A 27B dense model would solve the three-surfaces problem by brute force but wouldn't fit in a laptop RAM budget at acceptable inference latency. MoE (our target Gemma 4 27B MoE with 3B active) sidesteps this — it's the natural next base. The router-and-adapter pattern composes with MoE rather than competing with it.
6. Limitations
The router head is currently trained supervised on labeled surfaces and can misroute on ambiguous inputs ("can you show me this in my calendar and book it?"). We're evaluating a latent, multi-adapter-mix approach where both Voice and Calendar adapters can be active simultaneously.
@misc{mingllm2026surfaces,
title={One Model, Many Surfaces: A Unified Base for Voice, Web, and Code Agents},
author={MingLLM Research},
year={2026},
url={https://mingllm.com/papers/surfaces}
}
Paper · Position
Against Scale
MingLLM Research · 2026 · A position paper on expert iteration, bounded task distributions, and the end of the compute-moat thesis
Abstract
The dominant thesis of the last five years is that capability scales with compute, parameters, and data — a thesis that justifies an industry of billion-dollar training runs. On bounded task distributions — the kind a personal agent actually faces — this thesis is wrong. Expert iteration on a 4B base, run on a single MacBook over a weekend, matches frontier cloud models that are 50–400× its size. We ran the experiment. Three cycles, ~3 hours each, $0 in cloud compute, 15.3% → 94.9% on a real-world browser agent benchmark. This paper makes the position explicit: for bounded agency, scale is not the moat. Trajectories are.
1. The scaling orthodoxy
The argument for scale has three legs: (i) emergence — capabilities appear only above a parameter threshold; (ii) sample efficiency — bigger models learn from less; (iii) generality — bigger models transfer to more tasks. Each is true in the limit and misleading in practice.
For a specific agent facing a specific task distribution — book my flights, drive my browser, ship my code — the relevant question is not "how many capabilities can this model have in principle?" but "how many of the capabilities this user needs does it have reliably?" We show that this narrower question is answerable with loops, not scale.
2. The experiment
Setup. Base: Gemma-4 4B. Hardware: single Apple M5 Max MacBook, 64 GB unified memory. Cloud compute: $0. Training framework: MLX. Task distribution: 59 real end-to-end browser tasks drawn from production traffic.
Method. Expert iteration (ReST-style). Each cycle:
Roll out. The current model attempts each of the 59 tasks; traces are recorded.
Filter wins. Only trajectories that successfully complete the task are kept. No human labels.
Fine-tune. SFT on wins mixed with baseline data (3:1 for tool use, 2:1 for knowledge) — a LoRA of rank 16 and α=32, 800 iterations, LR 1e-5 (AdamW). ~2–3 hours per cycle.
Fuse + evaluate. LoRA fused into base weights. Fused model evaluated on held-out and WebArena-lite; if metrics improve, the fused model becomes next cycle's base.
3. Results
E2E task pass rate by cycle · 4B base on one MacBook
Parameters × performance · saturday-4b vs frontier class
Cycle
E2E (59)
WebArena-lite (35)
Cumulative wall-clock
Cumulative $ cloud
Base (Gemma-4 4B)
15.3%
—
0 h
$0
Cycle 1
64.4%
71.4%
~3 h
$0
Cycle 2
86.4%
85.7%
~6 h
$0
Cycle 3
94.9%
91.4%
~9 h
$0
Cycle 3 matches or beats the reported scores of every public frontier model we can compare against on the same benchmark class, at 50–400× fewer parameters and ~nine hours of laptop time.
4. Against scale
Three claims, in increasing order of contentiousness:
Narrow. For bounded task distributions, a small model plus expert iteration dominates a frontier model plus prompting on both capability and cost.
Actionable. The ingredients required — a 4B base, a scoring function, and a laptop — are within reach of anyone who wants an agent.
Strong. The marginal return on scale for agentic tasks, conditional on access to the user's task distribution, is approximately zero. The moat is trajectories, not parameters.
We expect disagreement on the strong claim. We note only that the moat argument is empirically testable, and that our test fails to find it.
5. What this doesn't claim
We do not claim scale is never useful. For genuinely open-ended generation, open-world reasoning, or extremely long horizons, scale helps. We claim that personal agency — the kind MingLLM builds — is not one of those regimes.
6. What this implies
If the moat for personal agency is the task distribution and not the model, the economics of the industry inverts. The interesting asset is no longer a trillion-parameter checkpoint; it is a tight loop between a real user, a scoring function, and a 4B model on that user's laptop. This is the thesis MingLLM is built on.
The model generates trajectories by attempting real production tasks — navigating websites, filling forms, answering queries. Each rollout is a complete execution trace from initial state to final action, recorded with full observation-action pairs.
We use Gemma-4 4B with 30+ browser tools (tcb_smart_click, tcb_deep_inspect, tcb_type, tcb_navigate, etc.) to interact with real web environments. Each rollout produces a trajectory of 5–20 steps depending on task complexity.
Batch size per cycle: 59 unique E2E tasks, with multiple rollout attempts per task to ensure diversity.
Step 2: Filter Wins
Each trajectory is scored against the ground-truth task completion. Winning trajectories — those that successfully complete the full task — are selected for training. Failed trajectories are discarded or kept at low ratio for negative sampling.
The filtering step is critical: it creates a high-quality training signal without any human annotation. The model learns exclusively from its own successful demonstrations.
Anti-forgetting mix: winning trajectories are mixed with baseline (general capability) data at a 3:1 ratio for tool-use and 2:1 for knowledge preservation.
Step 3: Fine-tune
Supervised fine-tuning on the filtered winning trajectories using LoRA (rank 16, alpha 32). Training runs for 800 iterations with learning rate 1e-5 (AdamW) on the MLX framework, entirely on Apple M5 Max hardware.
After fine-tuning, the LoRA adapter is fused into the base model weights. This fused model becomes the starting point for the next iteration cycle, progressively accumulating agentic capabilities.
The entire fine-tuning cycle takes approximately 2-3 hours on consumer Apple Silicon hardware, making it practical for iterative development.
Step 4: Evaluate
The updated model is evaluated on held-out test tasks (never seen during training), WebArena-lite benchmark (35 tasks), and an 80-test multi-turn assessment suite.
Evaluation gates: a model must pass a minimum threshold on held-out tasks before being promoted as the new production model. This prevents regression from any single training cycle.
If evaluation metrics improve, the cycle repeats from Step 1 with the new model. If metrics plateau or regress, the cycle terminates and the best checkpoint is selected.
Jarvis · voice
A voice on your desk that actually does things.
You talk. It listens, thinks, and drives the apps on your machine. No wake-word theater. No "I can't help with that." It finishes the job.
listening
What it sounds like
"clear my inbox and draft replies to the important ones"
"book the dentist for next Tuesday afternoon"
"find the cheapest flight to Tokyo next Friday"
"kill the dev server, I want to rebase"
What it can reach
Calendar
Your day, rearranged by asking.
Mail
Reads, triages, drafts in your voice.
Files
Finds, opens, edits — without you clicking.
Browser
Hands the task to Tensor and comes back.
Terminal
Runs shell, ships code via Tensor Code.
Background
Long jobs happen. You go live your life.
How it gets sharper
Jarvis watches how you work. When it notices a gap, it asks. What it learns becomes part of it — no restart, no cloud round-trip.
interview
synth
critic
train
live
Runs where
Apple Silicon. On-device by default. The cloud is optional, and when it's on, you see it.
Tensor · web
A browser that browses for you.
A Chrome extension you install once. Ask it to do the thing — find it, click through it, fill the form, read the tabs, finish the flow. You don't watch. You check back.
vendor-onboarding.example.com
tensor
filling page 3 of 7
What to hand it
"fill this onboarding flow with my info"
"research the top 3 CRMs under $50/seat, summarize"
"apply to every role on this careers page that matches my resume"
"book whichever flight is cheapest before 6pm"
Why it works
Zero-shot forms
Any site. No per-site config.
Multi-tab research
Reads across tabs, synthesizes, cites.
Side panel
Watches alongside; never blocks you.
Local
Your browsing stays on your machine.
Install where
Chrome. Edge. Brave. Arc. Anything Chromium. One extension, a plasma dot in your toolbar.
Tensor Code · terminal
A coder in your terminal that actually ships.
You type what you want. It reads the repo, plans, edits, runs the tests, fixes what it breaks, and tells you what it did. Proactive by default — it doesn't idle.
~/project · tensor
$tensor "fix the failing tests"
→ reading 14 files…
→ running npm test
✓ 18 passing · 3 failing
→ patching src/auth/session.ts
→ re-running tests
✓ all 21 passing · done in 9.2s
$
What to ask
"fix the failing tests"
"wire a Stripe checkout into the /pricing page"
"migrate this to React 19, keep everything green"
"the staging build is broken — look at the error, fix it, push"
Proactive by default
After the first instruction, it keeps moving. Reads, tries, fails, fixes, ships. You steer when you want to; you don't have to babysit.
Gets to know your repo
Every session adds to a local memory of how your codebase works — patterns it's seen, conventions it's respected, what the tests actually check. The next session starts where this one left off.
Install where
Anywhere a terminal runs. macOS, Linux, WSL. Your keys, your models, your rules.
We use minimal cookies for analytics. Your data never leaves your device. Privacy Policy