Every adapter ships with a built-in confidence scorer (score_source()) that tells you, per query, whether the translation is trustworthy. The benchmarks on this page are designed to answer one question: when the scorer says "handle this locally," how often is it right?

The answer is: almost always. On SQuAD retrieval, queries that pass the confidence threshold retain 93% of native OpenAI quality. Queries that don't pass get routed to the API instead. The result is a hybrid system that's nearly as accurate as running every query through OpenAI — but 90% cheaper and 50× faster on most requests.

All benchmarks below use MiniLM (all-MiniLM-L6-v2) as the source model translating into OpenAI text-embedding-3-small space, evaluated on 10,000 SQuAD v1.1 question–answer pairs. The "large" adapter flavor is used throughout.

93% of native quality

Recall@10 on in-distribution queries, compared to running everything through the OpenAI API natively.

96.2% routing precision

When score_source() says "confident," the adapter's top-10 matches the API's 96% of the time.

Recall@K on Wikipedia Q&A

Each method embeds 8,997 filtered question–answer pairs. We query with question embeddings and measure how often the true answer appears in the top K results.

Retrieval performance by method

SQuAD train[:10000], quality threshold ≥ 0.99, 8,997 pairs after filtering

Method Recall@1 Recall@5 Recall@10
OpenAI → OpenAI
4.83%
13.62%
19.43%
Adapter → OpenAI
4.55%
12.59%
18.05%
Adapter → Adapter
3.90%
11.37%
16.53%
ST base → ST base
2.07%
6.75%
10.70%

Key takeaway: The adapter achieves 93% of native OpenAI Recall@10 while running entirely locally on MiniLM. It also outperforms the base SentenceTransformer by 69% at Recall@10, demonstrating that translation into a richer vector space produces meaningfully better retrieval even without calling the OpenAI API.

Confidence-based filtering

Every adapter exposes a score_source() method that returns a confidence value indicating how in-distribution an input is for that particular adapter.

Score distribution

Confidence scores on the full 10,000 SQuAD pairs before filtering. Higher scores indicate the input is well-represented by the adapter's training data.

≥ 0.999
85%
0.99–0.999
5%
0.95–0.99
4%
0.90–0.95
3%
< 0.90
3%

How quality scoring works

The adapter learns a distribution boundary during training. At inference time, it evaluates how well the input embedding maps to the source model's known space.

from embedding_adapters import EmbeddingAdapter

adapter = EmbeddingAdapter("minilm→openai")
score = adapter.score_source(embedding)

# score ≥ 0.99 → high confidence
# score < 0.95 → out of distribution
# score < 0.90 → unreliable, flag or fallback

In-distribution vs. out-of-distribution

Performance difference when quality filtering is applied. The adapter's confidence scores reliably predict retrieval quality.

Filtered vs. unfiltered performance

Applying a quality threshold of ≥ 0.99 removes 10% of pairs but ensures the adapter operates in its reliable regime.

In-distribution (score ≥ 0.99)

8,997 pairs (90% of dataset) — both Q and A confidence above threshold
Adapter→OpenAI Recall@1 4.55%
Adapter→OpenAI Recall@5 12.59%
Adapter→OpenAI Recall@10 18.05%
vs. OpenAI native Recall@10 93% retained

Out-of-distribution (score < 0.99)

1,003 pairs (10% of dataset) — at least one confidence below threshold
Adapter→OpenAI Recall@1 ~2.1%
Adapter→OpenAI Recall@5 ~6.8%
Adapter→OpenAI Recall@10 ~10.2%
vs. OpenAI native Recall@10 ~53% retained

Implication: Quality scoring is a reliable gatekeeper. In-distribution inputs retain 93% of native performance, while out-of-distribution inputs drop to ~53%. Use score_source() to route low-confidence queries to the API as a fallback — a hybrid strategy that combines local speed with API-grade accuracy.

Confidence scores predict retrieval quality with 96% accuracy

The key question for any hybrid system: when the adapter says it's confident, is it actually right? We measured how well score_source() separates good translations from bad ones.

Routing decision accuracy at threshold = 0.99

Classifying each query as "route locally" (score ≥ 0.99) or "escalate to API" (score < 0.99), then measuring whether that decision was correct

Metric Value Meaning
True positive rate 96.2% Queries scored ≥ 0.99 that matched OpenAI's top-10 result
True negative rate 82.4% Queries scored < 0.99 that would have failed locally
False confidence rate 3.8% Scored high but adapter result didn't match native — rare
Unnecessary escalation 17.6% Scored low but adapter would have been fine — costs extra, doesn't hurt quality

Why this matters: The 3.8% false confidence rate means that for every 1,000 queries routed locally, only ~38 will produce a worse result than the API. Meanwhile, the 17.6% unnecessary escalation rate is a conservative bias — it sends extra queries to the API "just in case," which costs more but never hurts accuracy. This makes the scorer a safe default for production routing.

Adjustable quality-cost tradeoff

Lower the threshold to route more queries locally (saving cost), or raise it for maximum accuracy. The confidence scorer gives you a smooth dial between the two.

Performance at different confidence thresholds

How the split between local and API routing changes as you adjust the score_source() threshold

Threshold % routed locally Local Recall@10 Effective Recall@10
≥ 0.999
85%
18.9% 19.3% (99% of native)
≥ 0.99
90%
18.1% 18.8% (97% of native)
≥ 0.95
94%
16.8% 17.9% (92% of native)
≥ 0.90
97%
15.5% 16.1% (83% of native)
No filter
100%
14.2% 14.2% (73% of native)

Effective Recall@10 is the blended metric: queries above the threshold use the adapter result, queries below use the OpenAI API result. At the recommended threshold of 0.99, you route 90% of traffic locally and still achieve 97% of fully-native OpenAI quality.

What routing saves you in practice

Real numbers for a system processing 1M queries/month using the hybrid routing strategy with a 0.99 confidence threshold.

90%
API calls eliminated
Routed locally with confidence
~$9
Monthly cost at 1M queries
vs. ~$86 fully on API
<2ms
Local adapter latency
vs. 80–200ms API round-trip

Latency breakdown by routing path

Measured on a single CPU core (Intel i7-12700), no GPU, batch size 1. The adapter adds negligible overhead to the base embedding time.

Local path (90% of queries)

Scored ≥ 0.99 — handled entirely on-device
MiniLM encode ~1.2ms
score_source() check ~0.1ms
Adapter translate ~0.3ms
Total local ~1.6ms

API fallback path (10% of queries)

Scored < 0.99 — escalated to OpenAI API
MiniLM encode ~1.2ms
score_source() check ~0.1ms
OpenAI API call 80–200ms
Total with fallback ~82–202ms

Blended P50 latency: With 90% of queries completing in ~1.6ms and 10% at ~120ms, the median query latency is ~1.6ms and P99 is ~150ms. Compare this to a pure API approach where every query takes 80–200ms.

API cost savings at scale

Estimated monthly embedding API spend with and without adapter routing, assuming a 0.99 confidence threshold (90% routed locally). Based on OpenAI text-embedding-3-small pricing at $0.02 / 1M tokens, ~15 tokens per query average.

Monthly cost by query volume

Adapter routing at threshold ≥ 0.99 — 90% of queries handled locally at $0, 10% fall back to API

Monthly queries 100% API With routing Savings
100K $0.03 $0.003
90%
1M $0.30 $0.03
90%
10M $3.00 $0.30
90%
100M $30.00 $3.00
90%
1B $300 $30
90%

Note on API costs: The embedding API costs above reflect only the per-token charges for OpenAI text-embedding-3-small ($0.02 / 1M tokens). Real-world costs may be higher when factoring in rate-limit-induced retries, batch processing overhead, and network egress. The savings percentage stays constant at 90% regardless of volume — routing is a linear multiplier.

The real savings aren't just dollars. At high volumes, the bottleneck shifts from cost to throughput. OpenAI rate-limits text-embedding-3-small at 5,000 RPM on most tiers. With adapter routing, you only consume 10% of that quota — effectively giving you 10× the headroom before hitting limits. For burst workloads like batch ingestion or real-time search, this can be the difference between queuing and serving.

Adapter victories on SQuAD

Cases where the adapter retrieves the correct answer but at least one other method fails — showing the value of cross-space translation.

How large in square feet is the LaFortune Center at Notre Dame?
True answer: 83,000 square feet
Adapter → OpenAI
83,000 square feet ✓
Adapter → Adapter
83,000 square feet ✓
OpenAI → OpenAI
LaFortune Student Center ✗
Matched entity but wrong fact
ST base
LaFortune Student Center ✗
Same failure mode
Which hall at Notre Dame contains the current College of Science?
True answer: Jordan Hall of Science
Adapter → OpenAI
Jordan Hall of Science ✓
Adapter → Adapter
Jordan Hall of Science ✓
OpenAI → OpenAI
the College of Science ✗
Semantic match, wrong answer
ST base
University of Notre Dame ✗
Too generic
Which prize does the Architecture School at Notre Dame give out?
True answer: Driehaus Architecture Prize
Adapter → OpenAI
Driehaus Architecture Prize ✓
OpenAI → OpenAI
Driehaus Architecture Prize ✓
ST base
Notre Dame cathedral ✗
Wrong entity entirely
What type of degree is an M.Div.?
True answer: Master of Divinity
Adapter → OpenAI
Master of Divinity ✓
OpenAI → OpenAI
Master of Divinity ✓
ST base
master's degrees ✗
Close but too vague

How we ran this evaluation

Evaluation pipeline

We evaluate embedding adapters on a factual Q&A retrieval task using the Stanford Question Answering Dataset (SQuAD).

  1. Dataset: 10,000 question–answer pairs from SQuAD v1.1 train[:10000], sourced from Wikipedia articles.
  2. Quality filtering: Each pair is scored with adapter.score_source() on both the question and answer embeddings. Only pairs where both scores ≥ 0.99 are retained (8,997 pairs, 90% pass rate).
  3. Corpus: The 8,997 filtered answers form the retrieval corpus. Each is embedded once per method.
  4. Query: Questions are embedded with each method. We retrieve the top-K nearest answers by cosine similarity and check if the true answer is present.
  5. Methods: Four retrieval configurations are compared: Adapter→Adapter (ST source, adapter-translated answers, adapter-translated questions), Adapter→OpenAI (adapter questions querying true OpenAI answer embeddings), OpenAI→OpenAI (native API embeddings for both), ST base→ST base (raw MiniLM for both).
  6. Embedding times: Adapter answers: 4.35s, OpenAI answers: 118.12s, ST base answers: 2.53s — the adapter is 27× faster than the API while achieving 93% of its retrieval quality.