Latency-sensitive RAG

Use a lightweight local model for queries while your index was built with a heavier cloud provider. Get sub-10ms embeddings without sacrificing retrieval quality.

🔄

Gradual model migration

Move from one embedding provider to another incrementally. No need to re-embed millions of documents at once — translate on the fly during the transition.

🛡️

Provider resilience

Handle rate limits, outages, and provider changes without breaking your retrieval pipeline. Route through local adapters when the cloud is unavailable.

🧪

Model evaluation

Compare how different embedding providers behave on your data without committing to any single one. Sample multiple spaces from a single source model.

🔍

Quality routing

Detect when a query is too complex for your local model and intelligently route it to a stronger provider. Built-in confidence scoring tells you when to escalate.

💰

Cost optimization

Slash embedding API costs by running common queries through a free local model with an adapter, and reserving cloud calls for only the hardest cases.

How teams use adapters in practice

Migrate from OpenAI to a local model without downtime

Your Pinecone index has 5M documents embedded with OpenAI text-embedding-3-small. You want to switch to a local model for cost and latency reasons, but re-embedding everything would take days and require downtime.

With an adapter, you embed new queries locally with MiniLM, translate them into OpenAI's space on the fly, and query your existing index directly. Zero downtime, zero re-indexing. Migrate at your own pace.

New query
MiniLM encode
Adapter translate
Query OpenAI index

Hybrid routing: local-first with API fallback

You want most queries handled locally for speed and cost, but some queries are out-of-distribution for your adapter. Use score_source() to check confidence per query, and only call the API when the adapter isn't sure.

Result: 90% of queries resolve in <2ms locally. The remaining 10% fall back to the API. Your blended cost drops by 90% and median latency drops by 50×.

Query
score_source()
score ≥ 0.99? Use adapter
score < 0.99? Call API

Multi-provider interoperability

Different teams in your org use different embedding models — one team uses BGE, another uses OpenAI, a third uses E5. With adapters, you can define a single "canonical" embedding space and translate everything into it.

Treat embedding space as a contract, not an implementation detail. Each team keeps their model; adapters handle the translation layer.