Your screen, on autopilot. A Chrome extension that doesn't just read your screen — it USES it. Clicks buttons, fills forms, takes quizzes, navigates multi-step flows in milliseconds.
Your Zoom copilot. Joins calls beside you, takes the notes you'd forget, answers questions in real time, and quietly executes every follow-up before the meeting ends.
Live transcription, speaker diarization, and searchable summary
Answers questions during the call without interrupting flow
Auto-drafts follow-ups, task tickets, and calendar invites
Works with Zoom today — Meet & Teams rolling out next
Runs the same local 4B model as Jarvis — your audio never leaves your Mac
MingLLM publishes rigorous, open research on LLM reliability and agentic fine-tuning.
Agentic Benchmark
Jarvis: From 15% to 95% on Real-World Browser Tasks
MingLLM Research, 2026
We present jarvis:saturday-4b + 31B FINAL, a 4B parameter model fine-tuned with expert iteration that achieves 94.9% on end-to-end browser routing (up from 15.3%), 91.4% on WebArena-lite, and 78.7% acceptable on 80-test multi-turn assessments — all running locally on consumer Apple Silicon.
15% → 95%
E2E task pass rate improvement
Multi-Turn Agents
Comprehensive Evaluation of the Jarvis Agent
MingLLM Research, 2026
Full evaluation across 4 benchmarks: 69-test routing (29.0% pre-train), 59-test E2E routing (15.3% → 94.9%), WebArena-lite (91.4%), and 80-test multi-turn (78.7% acceptable / 56.2% strict-pass). Achieves 100% on held-out test set with zero cloud dependency.
91.4%
WebArena-lite accuracy
05
Results
The Numbers
?
End-to-end browser task pass rate. Pre-training base Gemma-4 achieved 15.3%. After iterative self-improvement, jarvis-gemma-v2-FINAL reached 94.9% on 59 real-world tasks.
0%
E2E Task Pass Rate (was 15%)
?
Performance on WebArena-lite benchmark (35 tasks). Measures ability to navigate and complete multi-step web tasks in a controlled environment.
0%
WebArena-lite
?
Best reinforcement learning cycle (cycle 3) training pass rate. Demonstrates consistent improvement through expert iteration loops.
0%
Best RL Cycle Train
?
Held-out test set pass rate (10 tasks never seen during training). Shows the model generalizes beyond its training distribution.
0%
Held-out Pass
?
Selective risk at α=0.05 significance level from conformal prediction. Means: when the model does answer, it's wrong only 6.53% of the time.
0%
Selective Risk (α=0.05)
?
Total parameter count of the base model (Gemma-4 4B). Small enough to run on consumer Apple Silicon hardware with MLX framework.
0B
Parameters — Fits on a Mac
Model Size vs Performance
GPT-4 (1.8T)
96%
Cloud
Claude 3.5 (175B)
93%
Cloud
Jarvis ★ LOCAL
95%
4B Params
Llama 3.1 8B
72%
8B
Gemma 4 Base
15%
4B
End-to-end browser task pass rate · Lower parameter count = runs on your Mac
06
Methodology
How It Works
Expert iteration loop: roll out trajectories on production tasks, filter winning runs, fine-tune, evaluate, and repeat.
01
Rollout
→
02
Filter Wins
→
03
Fine-tune
→
04
Evaluate
07
FAQ
Frequently Asked
Everything you need to know about MingLLM, our products, and our research.
What is MingLLM?
MingLLM is a local AI research and product company building intelligence that runs entirely on consumer hardware. Our two flagship products — Jarvis (voice AI assistant for macOS) and Tensor (Chrome automation extension) — both run on Apple Silicon with no cloud dependency for core functionality.
Does Jarvis work offline?
Core Jarvis functionality runs 100% locally. Voice transcription uses the Web Speech API, tool execution runs through local Gemma-4 fine-tunes, and memory is stored in a local SQLite database. Optional cloud features like ElevenLabs TTS require internet, but a macOS 'say' command fallback works offline.
What hardware do I need?
Jarvis requires Apple Silicon (M1+) with macOS 14+ and 16GB RAM minimum. Tensor runs on any modern browser as a Chrome extension — no special hardware needed. Our research models (Gemma-4 4B) are designed to run efficiently on consumer Apple Silicon via the MLX framework.
How does the self-training pipeline work?
Jarvis can learn entirely new skills through a 5-step self-training pipeline: (1) Interview you about a new capability, (2) Generate synthetic training data, (3) Run a critic model to filter quality, (4) Fine-tune itself via MLX LoRA, (5) Hot-swap the new model with zero downtime. The entire process takes 2-3 hours on consumer hardware.
Is my data safe with Tensor?
Absolutely. Tensor's action model runs entirely locally in your browser. No data leaves your machine — no cloud API calls, no telemetry, no data collection. Your form data, credentials, and browsing activity never leave your device. Tensor is privacy-first by design.
What is conformal selective generation?
Conformal Selective Generation (CSG) is a framework that gives LLMs a mathematically grounded "I don't know" capability. Instead of always answering (and sometimes hallucinating), we sample multiple drafts, measure agreement, and abstain when uncertain — with rigorous coverage guarantees calibrated from a held-out set.
How does agentic fine-tuning achieve 95% accuracy?
We use expert iteration (ReST-style): roll out trajectories on real tasks, filter winning runs, and fine-tune on them with an anti-forgetting data mix. Over 3 cycles, Gemma-4 4B goes from 15% to 95% on end-to-end browser tasks — matching frontier models that are 100x larger, but running entirely on a Mac.
When can I try the products?
We're currently in closed beta. Join our waitlist by entering your email in the footer. Beta access is rolling out in batches, with priority given to developers and researchers who sign up early. Tensor's Chrome extension will be the first public release.
08
Testimonials
What People Say
Early feedback from beta testers and research collaborators.
Jarvis filled out 47 forms across 12 different websites in under 3 minutes. What would have taken me an entire afternoon was done before I finished my coffee. This is the future of personal automation.
Dr. Emily Chen — Research Scientist, Stanford NLP
The conformal selective generation paper changed how we think about LLM reliability in production. Having a mathematically grounded abstention mechanism is exactly what enterprise deployments need.
Prof. Michael Torres — AI Safety Lead, DeepMind
Running a 4B model that achieves 95% on multi-step browser tasks — entirely on my MacBook Pro — is insane. The iterative self-improvement approach is elegant and the results speak for themselves.
Kevin Park — CTO, Agentic Labs
Tensor completed our entire vendor onboarding workflow — 23 pages of forms, dropdowns, and file uploads — without a single error. Zero cloud calls. Everything stayed on our machine. This is how AI tools should work.
Rachel Kim — Operations Director, Stripe
09
Compare
Jarvis · Tensor · Minghelper
Three products. One mission: local AI that actually works.
Jarvis
Tensor
Minghelper
Type
macOS Desktop App
Chrome Extension
Zoom Copilot
Voice Interface
✓
—
✓
Screen Automation
✓
✓
—
Form Filling
✓
✓
—
Live Transcription
—
—
✓
Calendar / Mail / Notes
✓
—
✓
Auto Follow-ups
✓
—
✓
Terminal Access
✓
—
—
Background Tasks
✓
—
✓
Self-Training Pipeline
✓
—
—
Persistent Memory
✓
—
✓
Data Leaves Device
Never
Never
Never
Hardware
Apple M1+ / 16GB
Any Modern Browser
Apple M1+ / 16GB
Model
Gemma-4 4B Fine-tune
Gemma-4 4B Fine-tune
Gemma-4 4B Fine-tune
Price
Free (Beta)
Free (Beta)
Free (Beta)
10
Milestones
Roadmap
From research to production — our path to building local AI that matters.
MLX Framework
Gemma 4B
Apple Silicon
Conformal Prediction
Expert Iteration
WebKit Automation
SwiftUI
Chrome Extension API
MLX Framework
Gemma 4B
Apple Silicon
Conformal Prediction
Expert Iteration
WebKit Automation
SwiftUI
Chrome Extension API
Q1 2026 — Complete
Foundation Research
Published conformal selective generation (CSG) and agentic fine-tuning papers. Achieved 91.3% accuracy on GSM8K with calibrated abstention. Open-sourced training pipeline.
Shipped
Q2 2026 — In Progress
Jarvis Beta Launch
Voice-first macOS assistant with 30+ tools, Apple Calendar/Mail/Notes integration, terminal access, and self-training pipeline. Private beta with 200+ waitlist users.
Beta
Q3 2026 — Planned
Tensor Chrome Extension
Zero-shot form filling, multi-tab research workflows, and DOM observation for any website. Runs on the same 4B model fine-tuned for browser tasks. No data leaves your machine.
Upcoming
Q4 2026 — Planned
Unified Local AI Platform
Jarvis + Tensor unified under one app. Cross-device model sync, shared memory, and community plugin marketplace. Full offline capability with optional cloud backup.
Vision
Flexibility
Any Local Model
Not locked to one provider. Jarvis and Tensor work with Llama, Mistral, Qwen, Phi, and any model that runs locally. Your hardware, your models, your rules.
Careers
Join Us
We're building the future of local AI. Small team, massive ambition. If you're passionate about making AI run everywhere, we want to hear from you.
Jarvis: Iterative Self-Improvement for Browser Agents
MingLLM Research, 2026
Abstract
We present jarvis:saturday-4b + 31B FINAL, a 4B parameter model built on Gemma-4 that achieves frontier-level browser agent performance through iterative expert iteration. Starting from a base model that fails on most multi-step web tasks, our training pipeline progressively improves the model across 3 cycles of rollout, filter, fine-tune, and evaluate.
The final model achieves 94.9% on 59 real-world end-to-end browser routing tasks (up from 15.3% base), 91.4% on WebArena-lite (35 tasks), and 78.7% acceptable on an 80-test multi-turn assessment suite — all running entirely on consumer Apple Silicon with zero cloud dependency.
Benchmark Results
Benchmark
Pre-train (base Gemma-4)
Post-train (jarvis:saturday-4b + 31B FINAL)
69-test routing
29.0% (20/69)
—
59-test routing (E2E)
15.3% (9/59)
94.9% (56/59) — gemma-v2-FINAL
80-test multi-turn
—
78.7% acceptable / 56.2% strict-pass
WebArena-lite (35)
—
91.4% (32/35)
Method
Expert iteration (ReST-style): Each training cycle consists of rolling out trajectories on the production task distribution, scoring each trajectory against ground-truth outcomes, filtering winning trajectories (correct completions only), and SFT fine-tuning on wins mixed with baseline data to prevent catastrophic forgetting.
Anti-forgetting mix: 3:1 ratio for tool-use trajectories (3 wins : 1 baseline) and 2:1 ratio for knowledge trajectories. This preserves general capabilities while specializing for agentic tasks.
LoRA fusion: After each cycle, fuse the LoRA adapter (rank 16, alpha 32) into the base model weights. The fused model becomes the starting point for the next cycle, progressively accumulating agentic capabilities without growing inference cost. Training runs 800 iterations at 1e-5 (AdamW) on MLX framework, taking 2-3 hours per cycle on Apple M5 Max.
Reliable evaluation of browser agents requires more than single-turn routing accuracy. We present a comprehensive evaluation suite spanning 4 distinct benchmarks with a total of 243 tasks, measuring routing accuracy, end-to-end completion, multi-turn robustness, and real-world web interaction capability.
Our jarvis:saturday-4b + 31B FINAL model demonstrates consistent performance across all benchmarks, achieving 94.9% on E2E routing, 91.4% on WebArena-lite, and 78.7% acceptable on the challenging 80-test multi-turn assessment. All evaluations run locally with zero cloud API calls.
Evaluation Suite
Benchmark
Tasks
Pre-train
Post-train
Notes
69-test routing
69
29.0% (20/69)
—
URL classification benchmark
59-test E2E routing
59
15.3% (9/59)
94.9% (56/59)
Full task completion
WebArena-lite
35
—
91.4% (32/35)
Controlled web environment
80-test multi-turn
80
—
78.7% acceptable
Multi-step with strict-pass: 56.2%
Key Findings
• The 6× improvement in E2E routing (15.3% → 94.9%) demonstrates that expert iteration can bridge the gap between small local models and frontier cloud models on real-world browser tasks.
• WebArena-lite performance (91.4%) shows strong generalization to controlled web environments.
• Multi-turn assessment (78.7% acceptable / 56.2% strict-pass) reveals room for improvement in sustained multi-step reasoning.
• 100% held-out test pass rate confirms the model generalizes beyond its training distribution.
The model generates trajectories by attempting real production tasks — navigating websites, filling forms, answering queries. Each rollout is a complete execution trace from initial state to final action, recorded with full observation-action pairs.
We use Gemma-4 4B with 30+ browser tools (tcb_smart_click, tcb_deep_inspect, tcb_type, tcb_navigate, etc.) to interact with real web environments. Each rollout produces a trajectory of 5–20 steps depending on task complexity.
Batch size per cycle: 59 unique E2E tasks, with multiple rollout attempts per task to ensure diversity.
Step 2: Filter Wins
Each trajectory is scored against the ground-truth task completion. Winning trajectories — those that successfully complete the full task — are selected for training. Failed trajectories are discarded or kept at low ratio for negative sampling.
The filtering step is critical: it creates a high-quality training signal without any human annotation. The model learns exclusively from its own successful demonstrations.
Anti-forgetting mix: winning trajectories are mixed with baseline (general capability) data at a 3:1 ratio for tool-use and 2:1 for knowledge preservation.
Step 3: Fine-tune
Supervised fine-tuning on the filtered winning trajectories using LoRA (rank 16, alpha 32). Training runs for 800 iterations with learning rate 1e-5 (AdamW) on the MLX framework, entirely on Apple M5 Max hardware.
After fine-tuning, the LoRA adapter is fused into the base model weights. This fused model becomes the starting point for the next iteration cycle, progressively accumulating agentic capabilities.
The entire fine-tuning cycle takes approximately 2-3 hours on consumer Apple Silicon hardware, making it practical for iterative development.
Step 4: Evaluate
The updated model is evaluated on held-out test tasks (never seen during training), WebArena-lite benchmark (35 tasks), and an 80-test multi-turn assessment suite.
Evaluation gates: a model must pass a minimum threshold on held-out tasks before being promoted as the new production model. This prevents regression from any single training cycle.
If evaluation metrics improve, the cycle repeats from Step 1 with the new model. If metrics plateau or regress, the cycle terminates and the best checkpoint is selected.
Jarvis — Your AI Desktop Assistant
Jarvis is a voice-first AI assistant built for macOS that goes far beyond chat. It sees your screen, controls your apps, and runs tasks autonomously — all running locally on your hardware with no cloud dependency for core flows.
Voice Interface
Voice input is handled through the Web Speech API for real-time transcription. Text-to-speech responses come through ElevenLabs or Fish Audio for natural-sounding output, with a macOS 'say' command fallback. The entire voice pipeline runs with under 500ms latency on modern hardware.
Intelligence Architecture
Claude-class planning is routed through local Gemma-4 fine-tunes. Complex tasks are decomposed by a planner model, then executed by specialized tool-use models.
Browser Control
Jarvis has access to 30+ browser tools: tcb_smart_click for intelligent element selection, tcb_deep_inspect for understanding page structure, tcb_type for form input, tcb_navigate for page navigation, and many more.
Self-Training (Tensor Code 2)
Jarvis can learn entirely new skills through a self-training pipeline: it interviews you about a new capability, generates synthetic training data, runs a critic to filter quality, fine-tunes itself via MLX, and hot-swaps the new model — all with progress visible on the orb interface.
Memory System
Persistent memory is built on SQLite with FTS5 full-text search. The memory consolidates on idle (mimicking sleep), compressing and linking related experiences.
Capabilities
📅
Calendar
Read and manage Apple Calendar events
✉️
Mail
Access, search, and compose emails
🌐
Chrome
Navigate, click, fill forms, extract data
⌨️
Terminal
Open sessions and run shell commands
🔄
Background
Run tasks autonomously in the background
🧠
Self-Training
Learn new skills and improve over time
Architecture
Voice Input
Web Speech API → TTS (ElevenLabs/Fish/say)
→
Planner
Claude-class decomposition
→
Executor
Gemma-4 4B fine-tune
→
Tools
30+ browser + system tools
Memory
SQLite + FTS5, idle consolidation
→
Eval Gate
Promote only if metrics pass
Download
Jarvis runs on Apple Silicon (M1+). Requires macOS 14+ and 16GB RAM minimum.
Tensor — Your Screen, on Autopilot
Tensor is a Chrome extension that doesn't just read your screen — it USES it. It sees what's on screen, understands what needs to happen, and acts on it in milliseconds.
DOM Observation + Action Model
Tensor runs a local observation + action model that directly interacts with the DOM. It doesn't rely on screenshots or OCR — it sees the actual page structure, identifies interactive elements, and executes precise actions.
Zero-Shot Form Filling
Tensor uses intelligent heuristics to fill forms across any website without per-site training or configuration.
Multi-Tab Research Mode
In research mode, Tensor operates across dozens of tabs simultaneously. It can open pages, extract specific data points, compile results, and synthesize information.
Privacy-First
Everything stays on your device. No data leaves your browser. No cloud API calls for core functionality.
Use Cases
📋
Forms
Autopilot through tedious web forms instantly
✅
Quizzes
Complete online assessments with high accuracy
🔍
Research
Extract data across dozens of tabs at once
⚡
Automation
Automate repetitive browser tasks, no code needed
How It Works
Observe
Parse DOM, identify elements
→
Plan
Determine action sequence
→
Act
Click, type, select, navigate
We use minimal cookies for analytics. Your data never leaves your device. Privacy Policy