All posts
CompanyMarch 4, 20262 min read

Why we are local-first

Cloud-first is a choice disguised as gravity. Here is what we are choosing instead.

By Yiming Beckmann

Every other agent product you use runs in the cloud. Some are clear about it; most leave it implicit. The architecture they ship reveals a choice: user data is gathered, transported, processed by a stranger, and returned. It is called "AI" but it is mostly "telecommunications."

We are building the opposite. Everything MingLLM does — inference, memory, training — runs on your device. Cloud is an opt-in escape hatch, not the primary path. Here is why.

Trust surface and capability surface should match

When an agent has access to your mail, your documents, your calendar, your screen, your microphone — the set of things it could know is enormous. The set of things the vendor can see should not be the same set.

If your agent runs in the cloud, the vendor sees everything the agent sees. You are trusting that the vendor will not look, will not retain, will not leak. That trust is not a technical guarantee; it is a promise in a terms-of-service document.

If your agent runs on your device, the vendor sees nothing. The trust surface shrinks to match the capability surface.

Latency is a product

Voice agents live or die on latency. The round-trip from your microphone to a cloud model and back is dominated by queue time and network, not by inference. Our median voice-to-first-action latency is 410ms, versus 2.4s for a comparable cloud agent. That 2-second difference is the difference between "this feels like an assistant" and "this feels like an API call."

Costs are visible

At ~200 queries/day per user, a cloud agent costs the vendor real money. That cost gets passed to the user as a subscription, or extracted indirectly through data. Local inference has a marginal cost of zero. The business model stops requiring you to be the product.

What we give up

We are not doctrinaire. There are tasks that a 4B local model cannot do and a trillion-parameter cloud model can: novel multi-day research, extensive web synthesis, expert-level domain reasoning. For those, MingLLM offers an opt-in cloud handoff — explicit, per-task, with the user approving the data that leaves the device.

Local-first is a default, not a dogma. The dogma is that you should know which mode you are in and choose.