Adapters aren't meant to fully replace native embeddings. They're meant to handle the 90% of queries where translation is reliable — and tell you about the other 10%.
Every adapter ships with a built-in confidence scorer (score_source()) that tells you, per query, whether the translation is trustworthy. The benchmarks on this page are designed to answer one question: when the scorer says "handle this locally," how often is it right?
The answer is: almost always. On SQuAD retrieval, queries that pass the confidence threshold retain 93% of native OpenAI quality. Queries that don't pass get routed to the API instead. The result is a hybrid system that's nearly as accurate as running every query through OpenAI — but 90% cheaper and 50× faster on most requests.
All benchmarks below use MiniLM (all-MiniLM-L6-v2) as the source model translating into OpenAI text-embedding-3-small space, evaluated on 10,000 SQuAD v1.1 question–answer pairs. The "large" adapter flavor is used throughout.
Recall@10 on in-distribution queries, compared to running everything through the OpenAI API natively.
When score_source() says "confident," the adapter's top-10 matches the API's 96% of the time.
Each method embeds 8,997 filtered question–answer pairs. We query with question embeddings and measure how often the true answer appears in the top K results.
SQuAD train[:10000], quality threshold ≥ 0.99, 8,997 pairs after filtering
| Method | Recall@1 | Recall@5 | Recall@10 |
|---|---|---|---|
| OpenAI → OpenAI | |||
| Adapter → OpenAI | |||
| Adapter → Adapter | |||
| ST base → ST base |
Key takeaway: The adapter achieves 93% of native OpenAI Recall@10 while running entirely locally on MiniLM. It also outperforms the base SentenceTransformer by 69% at Recall@10, demonstrating that translation into a richer vector space produces meaningfully better retrieval even without calling the OpenAI API.
Every adapter exposes a score_source() method that returns a confidence value indicating how in-distribution an input is for that particular adapter.
Confidence scores on the full 10,000 SQuAD pairs before filtering. Higher scores indicate the input is well-represented by the adapter's training data.
The adapter learns a distribution boundary during training. At inference time, it evaluates how well the input embedding maps to the source model's known space.
Performance difference when quality filtering is applied. The adapter's confidence scores reliably predict retrieval quality.
Applying a quality threshold of ≥ 0.99 removes 10% of pairs but ensures the adapter operates in its reliable regime.
Implication: Quality scoring is a reliable gatekeeper. In-distribution inputs retain 93% of native performance, while out-of-distribution inputs drop to ~53%. Use score_source() to route low-confidence queries to the API as a fallback — a hybrid strategy that combines local speed with API-grade accuracy.
The key question for any hybrid system: when the adapter says it's confident, is it actually right? We measured how well score_source() separates good translations from bad ones.
Classifying each query as "route locally" (score ≥ 0.99) or "escalate to API" (score < 0.99), then measuring whether that decision was correct
| Metric | Value | Meaning |
|---|---|---|
| True positive rate | Queries scored ≥ 0.99 that matched OpenAI's top-10 result | |
| True negative rate | Queries scored < 0.99 that would have failed locally | |
| False confidence rate | Scored high but adapter result didn't match native — rare | |
| Unnecessary escalation | Scored low but adapter would have been fine — costs extra, doesn't hurt quality |
Why this matters: The 3.8% false confidence rate means that for every 1,000 queries routed locally, only ~38 will produce a worse result than the API. Meanwhile, the 17.6% unnecessary escalation rate is a conservative bias — it sends extra queries to the API "just in case," which costs more but never hurts accuracy. This makes the scorer a safe default for production routing.
Lower the threshold to route more queries locally (saving cost), or raise it for maximum accuracy. The confidence scorer gives you a smooth dial between the two.
How the split between local and API routing changes as you adjust the score_source() threshold
| Threshold | % routed locally | Local Recall@10 | Effective Recall@10 |
|---|---|---|---|
| ≥ 0.999 | |||
| ≥ 0.99 | |||
| ≥ 0.95 | |||
| ≥ 0.90 | |||
| No filter |
Effective Recall@10 is the blended metric: queries above the threshold use the adapter result, queries below use the OpenAI API result. At the recommended threshold of 0.99, you route 90% of traffic locally and still achieve 97% of fully-native OpenAI quality.
Real numbers for a system processing 1M queries/month using the hybrid routing strategy with a 0.99 confidence threshold.
Measured on a single CPU core (Intel i7-12700), no GPU, batch size 1. The adapter adds negligible overhead to the base embedding time.
Blended P50 latency: With 90% of queries completing in ~1.6ms and 10% at ~120ms, the median query latency is ~1.6ms and P99 is ~150ms. Compare this to a pure API approach where every query takes 80–200ms.
Estimated monthly embedding API spend with and without adapter routing, assuming a 0.99 confidence threshold (90% routed locally). Based on OpenAI text-embedding-3-small pricing at $0.02 / 1M tokens, ~15 tokens per query average.
Adapter routing at threshold ≥ 0.99 — 90% of queries handled locally at $0, 10% fall back to API
| Monthly queries | 100% API | With routing | Savings |
|---|---|---|---|
| 100K | |||
| 1M | |||
| 10M | |||
| 100M | |||
| 1B |
Note on API costs: The embedding API costs above reflect only the per-token charges for OpenAI text-embedding-3-small ($0.02 / 1M tokens). Real-world costs may be higher when factoring in rate-limit-induced retries, batch processing overhead, and network egress. The savings percentage stays constant at 90% regardless of volume — routing is a linear multiplier.
The real savings aren't just dollars. At high volumes, the bottleneck shifts from cost to throughput. OpenAI rate-limits text-embedding-3-small at 5,000 RPM on most tiers. With adapter routing, you only consume 10% of that quota — effectively giving you 10× the headroom before hitting limits. For burst workloads like batch ingestion or real-time search, this can be the difference between queuing and serving.
Cases where the adapter retrieves the correct answer but at least one other method fails — showing the value of cross-space translation.
We evaluate embedding adapters on a factual Q&A retrieval task using the Stanford Question Answering Dataset (SQuAD).
train[:10000], sourced from Wikipedia articles.adapter.score_source() on both the question and answer embeddings. Only pairs where both scores ≥ 0.99 are retained (8,997 pairs, 90% pass rate).