The AI model landscape changes every month. A new frontier model drops, benchmarks are shattered, prices shift, and the model that was best yesterday is second-best today. If your AI tool is locked to a single provider, you are locked to their timeline, their pricing, and their capabilities. When a better model emerges somewhere else, you cannot use it without switching tools entirely.
Tensor was designed from day one to be model-agnostic. We support six AI providers — OpenAI, Anthropic, Google Gemini, DeepSeek, OpenRouter, and Ollama — through a unified provider abstraction layer that lets you switch models with a single click. This post explains how we built that abstraction, why model agnosticism matters, and how we optimize performance across fundamentally different APIs.
The Provider Abstraction Layer
At Tensor's core is a module we call the Provider Abstraction Layer (PAL). PAL defines a unified interface that all AI interactions flow through, regardless of which provider or model is handling the request. The interface is simple:
interface ProviderAdapter {
// Core chat completion
chat(messages: Message[], options: ChatOptions): AsyncStream<Chunk>
// Capability detection
supports(capability: Capability): boolean
// Token counting
countTokens(text: string): number
// Model metadata
getModels(): ModelInfo[]
getDefaultModel(): string
// Provider-specific configuration
configure(config: ProviderConfig): void
validate(): Promise<ValidationResult>
}
Each provider implements this interface through an adapter. The OpenAI adapter translates PAL calls into OpenAI API requests. The Anthropic adapter does the same for Claude's Messages API. The Ollama adapter communicates with a local Ollama instance. From Tensor's perspective, all providers look the same. The complexity of each provider's unique API format, authentication scheme, and response structure is encapsulated within the adapter.
The Six Providers
OpenAI. The most widely used provider. Tensor supports GPT-4o, GPT-4o-mini, and all current Chat Completions models. The OpenAI adapter handles function calling, structured outputs, and vision inputs. It is the most feature-complete adapter because OpenAI's API has the broadest feature set.
Anthropic. Claude models excel at long-context tasks, nuanced reasoning, and careful instruction following. The Anthropic adapter handles Claude's unique message format (which separates system messages from the message array), supports the extended thinking feature for complex reasoning tasks, and implements Anthropic's prompt caching through cache control headers.
Google Gemini. Gemini models offer competitive performance with generous free tiers. The Gemini adapter translates PAL calls into the Generative Language API format, handles Gemini's unique safety settings, and supports Gemini's native multi-modal capabilities for image and document understanding.
DeepSeek. DeepSeek has emerged as a strong contender, particularly for coding tasks. Their models offer excellent performance at lower price points. The DeepSeek adapter is relatively straightforward because DeepSeek's API closely follows the OpenAI format, but we implement specific optimizations for their unique capabilities like fill-in-the-middle completion.
OpenRouter. OpenRouter is a meta-provider that gives access to dozens of models from multiple providers through a single API. The OpenRouter adapter dynamically fetches the available model list, handles per-model pricing display, and supports OpenRouter's unique features like fallback routing and provider preferences.
Ollama. Ollama runs open-source models locally on your machine. The Ollama adapter communicates with the local Ollama server over HTTP, auto-detects available models, and handles the streaming response format. This is the only provider that keeps your data entirely on your machine — not even the AI provider sees your prompts.
Why Model Agnosticism Matters
There are three compelling reasons to use a model-agnostic AI tool:
Best model for the job. No single model is best at everything. GPT-4o excels at code generation and function calling. Claude is exceptional at nuanced writing and long-context analysis. Gemini shines at multi-modal tasks. DeepSeek is remarkably strong at coding for its price. By supporting all of them, Tensor lets you pick the right model for each task. Write an email with Claude, debug code with GPT-4o, analyze a document with Gemini, and run a quick local query with Ollama — all within the same extension.
Cost optimization. Different providers charge different rates, and the pricing landscape shifts constantly. A task that costs $0.03 with GPT-4o might cost $0.01 with DeepSeek or $0.00 with Ollama. Tensor shows you the estimated cost per query for each provider, so you can make informed decisions about where your money goes. For high-volume tasks like persistent agents that run every 30 minutes, the cost difference between providers can be substantial over time.
Resilience. Providers have outages. OpenAI has gone down. Anthropic has gone down. Google has gone down. If your tool is locked to one provider, an outage means you are locked out. Tensor can automatically fall back to a secondary provider when the primary is unavailable, so your agents and workflows keep running.
Handling API Differences
The biggest engineering challenge in building PAL was not defining the interface — it was handling the vast differences between provider APIs. Here are some of the more interesting translation problems we solved:
Message format divergence. OpenAI uses a flat messages array with a role field. Anthropic separates the system message from the conversation messages. Gemini uses a contents array with a parts sub-array. Ollama follows the OpenAI format but with different streaming semantics. PAL normalizes all of this into a single message format internally and translates at the adapter boundary.
Streaming differences. Every provider streams responses differently. OpenAI sends SSE (Server-Sent Events) with data: prefixed lines. Anthropic sends SSE with typed event names. Gemini uses chunked JSON. Ollama sends newline-delimited JSON. Each adapter implements a stream parser that converts the provider's format into PAL's unified AsyncStream<Chunk> type.
Capability fragmentation. Not all models support all features. Vision (image inputs) is available on GPT-4o, Claude 3.5 Sonnet, and Gemini, but not on all DeepSeek models. Function calling is supported by OpenAI, Anthropic, and Gemini with different schemas. Extended thinking is Anthropic-only. PAL's supports(capability) method lets Tensor query a provider's capabilities at runtime and gracefully degrade when a feature is not available.
Prompt Caching
One of the most impactful optimizations we implemented is prompt caching. When you use Tensor, many of your requests share a common prefix: the system prompt, your Personal Context, and the conversation history. Sending this entire prefix with every request wastes tokens and money.
Anthropic pioneered prompt caching with their cache control feature, which lets you mark portions of the prompt as cacheable. On subsequent requests with the same prefix, the cached portion is not re-processed, resulting in significant cost and latency savings. We implemented similar caching for OpenAI using their native prompt caching and for other providers through our own client-side deduplication logic.
For Tensor specifically, prompt caching has an outsized impact because we include Personal Context in every request. A user with a 500-word Personal Context is sending roughly 700 tokens of identical prefix with every message. With caching, that prefix is processed once and reused for the duration of the session, reducing costs by 15 to 30 percent depending on conversation length.
// Simplified caching flow for Anthropic
const systemMessage = {
role: "system",
content: [
{
type: "text",
text: personalContext,
cache_control: { type: "ephemeral" } // Cache this
},
{
type: "text",
text: systemPrompt,
cache_control: { type: "ephemeral" } // Cache this too
}
]
};
// First request: processes full prompt (~700 tokens)
// Subsequent requests: reads from cache (near-zero cost)
Model Routing
Beyond manual model selection, Tensor includes an experimental automatic model routing feature. When enabled, Tensor analyzes the nature of your query and routes it to the most appropriate model based on task type, cost sensitivity, and your preferences.
A quick factual question might route to a fast, cheap model like GPT-4o-mini. A complex coding task might route to Claude or GPT-4o. A query that includes images might route to Gemini. A sensitive query that should stay local might route to Ollama. The routing logic is a lightweight classifier that runs before the main AI call, adding minimal latency while potentially saving significant cost.
Routing is fully transparent. Tensor always shows you which model handled your query and why it was selected. You can override the routing decision for any individual query or disable automatic routing entirely.
The Future of Multi-Provider
The AI provider landscape will continue to evolve rapidly. New providers will emerge, existing providers will release new models, and pricing will continue to shift. Tensor's architecture is designed to absorb this change. Adding a new provider means implementing one adapter — typically a few hundred lines of code — without touching any other part of the system.
We are also exploring provider composition, where a single query is processed by multiple models and the results are synthesized. Imagine asking a question and getting answers from GPT-4o, Claude, and Gemini, then having a synthesis step that combines the best parts of each response. This multi-model approach could produce better results than any single model alone.
Model agnosticism is not just a feature. It is a philosophy. We believe users should be free to choose their AI provider the same way they choose their search engine or email client. Tensor exists to make that choice frictionless.