Benchmark Results

Adapted models evaluated on real RAG retrieval against openai/text-embedding-3-large.

97%of provider quality
zero API calls
99%cost savings
vs provider direct
50msquery latency
local GPU
18K/stokens per second
single GPU

Overview

We benchmark adapters on standard retrieval datasets — Natural Questions (factoid Q&A) and HotpotQA (multi-hop reasoning). Adapted queries search a corpus embedded with OpenAI text-embedding-3-large — simulating a real production RAG setup where your index was built with a commercial provider.

The core finding: Adapters produce embeddings that retrieve nearly as well as the original provider — at a fraction of the cost and latency. On Natural Questions, Qwen3-0.6B→TE3 achieves MRR@10 of 0.934 at quality=0 (zero provider calls) vs OpenAI's 0.960. That's 97% of the quality at 0.3% of the cost.

Where adapters excel: On well-formed factual queries (NQ), adapters beat the raw MiniLM baseline at quality=0 — meaning the adapter actually improves retrieval over using MiniLM alone. On harder multi-hop queries (HotpotQA), the adapters start lower but the quality routing dial smoothly closes the gap.

Qwen3 vs MiniLM: Qwen3-0.6B→TE3 consistently scores 1–3% higher MRR than MiniLM→TE3, and reaches provider parity at much lower quality thresholds. The tradeoff is speed — MiniLM runs at 18K tok/s vs Qwen3's 1.2K tok/s. Choose MiniLM for throughput, Qwen3 for accuracy.

Understanding Quality Calibration

The quality parameter is a threshold from 0 to 100. Every text gets a confidence score from the adapter's neural quality head. Texts scoring below the threshold get re-embedded by the original provider.

The key insight: You want to find the lowest quality threshold where retrieval quality meets your requirements. Lower threshold = more local queries = lower cost. The calibrate endpoint (/v1/quality/calibrate) analyzes your actual data and recommends the right setting — so you never risk degraded performance below your source model.
How to read the quality dial
SettingWhat happensBest for
quality=0Everything runs locally. Zero provider calls. Maximum speed, minimum cost.High-throughput indexing, air-gapped systems, cost-sensitive apps
quality=30~8–21% of queries route to provider. Quality improves noticeably on hard queries.General-purpose search with cost control
quality=50~17–56% routed. Significant quality improvement, still massive cost savings (99%+).Balanced cost/quality for production RAG
quality=70~44–82% routed. Near-provider quality on most datasets.Quality-critical applications
quality=100Everything goes to the provider. Equivalent to calling OpenAI directly (but cheaper).Maximum quality, cost still lower than direct
Finding your safe threshold: Run POST /v1/quality/calibrate with a sample of your actual data. The endpoint returns your quality grade (excellent/good/moderate/poor), the score distribution, and a routing preview at every threshold. It specifically tells you the lowest quality level where you won't degrade below your source model — so you get the maximum cost savings without any retrieval risk.

Natural Questions

100 queries × 1,000 passages — factoid Q&A from Google search

NQ MRR

MRR@10 vs Quality Threshold

Adapters beat MiniLM raw (red dashed, 0.843) at quality=0 with zero provider calls. The curves show how MRR improves as you route more queries to OpenAI. Qwen3 reaches the TE3 line at q=40; MiniLM at q=90.

NQ Recall

R@1 and R@10 at Key Thresholds

R@10 (is the answer in the top 10?) stays above 0.97 for adapters even at q=0 — nearly matching OpenAI's 0.98. R@1 (exact top hit) climbs from 0.90/0.91 at q=0 to 0.94 at q=100.

NQ Routing

% Routed to Provider

Shows what fraction of queries the quality head sends to OpenAI at each threshold. At q=30, MiniLM routes just 8% while Qwen3 routes 21% — Qwen3's quality head is stricter but achieves higher accuracy at each level.

NQ Savings

Cost Savings vs OpenAI Direct

Savings stay above 98% at every quality level. Even at q=100 (everything routed), it's still cheaper than calling OpenAI directly because of our lower per-token rate.

NQ Cost vs Accuracy

Cost vs Accuracy Tradeoff

Left panel zooms into the adapter cost range ($0.00005–0.0002). Right panel shows the full scale — adapters are a tiny dot at the left, OpenAI is the diamond at $0.014. The 78× cost gap is visible at a glance.

Natural Questions — Full Quality Sweep
qMiniLM→TE3Qwen3-0.6B→TE3
%OAIMRR@10R@1Cost%OAIMRR@10R@1Cost
00%0.9260.900$0.0000760%0.9340.910$0.000047
100%0.9260.900$0.0000762%0.9340.910$0.000050
201%0.9340.910$0.00007710%0.9420.920$0.000060
308%0.9480.930$0.00008721%0.9480.930$0.000074
4013%0.9480.930$0.00009338%0.9530.930$0.000097
5017%0.9480.930$0.00010056%0.9530.930$0.000121
6027%0.9480.930$0.00011471%0.9580.940$0.000139
7044%0.9480.930$0.00013582%0.9580.940$0.000153
8056%0.9480.930$0.00014990%0.9580.940$0.000163
9081%0.9550.930$0.00018098%0.9600.940$0.000173
100100%0.9600.940$0.000204100%0.9600.940$0.000175
MiniLM rawMRR=0.843 · R@1=0.810 · R@10=0.940 · $0.000930
openai/text-embedding-3-largeMRR=0.960 · R@1=0.940 · R@10=0.980 · $0.014080
NQ takeaway: This is where adapters shine brightest. Adapters beat the MiniLM baseline at quality=0 — the adapter doesn't just translate, it improves retrieval. Qwen3 matches TE3 at q=40, meaning you can get OpenAI-level quality while keeping 62% of queries local. For most production RAG applications on well-formed queries, quality=0 or quality=30 is the sweet spot.

HotpotQA

100 multi-hop queries × 1,190 passages — requires reasoning across multiple documents

HotpotQA MRR

MRR@10 vs Quality Threshold

A harder test. Adapters start below MiniLM raw at q=0 — multi-hop queries challenge the adapter's translation fidelity. But quality routing fixes this: Qwen3 crosses the MiniLM line at q=30 and reaches TE3 at q=60.

HotpotQA Recall

R@1 and R@10 at Key Thresholds

Qwen3 achieves perfect R@10 (1.000) at every quality level — the correct answer is always in the top 10. R@1 climbs from 0.74 at q=0 to 0.86 at q=100, matching TE3.

HotpotQA Routing

% Routed to Provider

The quality head routes more aggressively here — it correctly identifies that multi-hop queries are harder and sends more to OpenAI. This is the intended behavior: adaptive routing based on query difficulty.

HotpotQA Savings

Cost Savings

Even with more routing on this harder dataset, savings stay above 98% at every level. The quality head is efficient — it routes the minimum necessary to maintain accuracy.

HotpotQA Cost vs Accuracy

Cost vs Accuracy Tradeoff

The quality curve shows smooth, predictable improvement. Each step up in quality threshold buys a measurable MRR improvement at a known cost — no surprises.

HotpotQA — Full Quality Sweep
qMiniLM→TE3Qwen3-0.6B→TE3
%OAIMRR@10R@1Cost%OAIMRR@10R@1Cost
00%0.8270.730$0.0001300%0.8350.740$0.000080
204%0.8270.730$0.00013919%0.8570.770$0.000123
3022%0.8470.770$0.00018940%0.8750.800$0.000177
5046%0.8660.800$0.00025176%0.8980.840$0.000274
6063%0.8780.810$0.00029885%0.9110.850$0.000303
7078%0.9040.850$0.00033793%0.9110.850$0.000326
8090%0.9140.860$0.000370100%0.9170.860$0.000349
100100%0.9170.860$0.000399100%0.9170.860$0.000349
MiniLM rawMRR=0.872 · R@1=0.800 · R@10=0.990
openai/text-embedding-3-largeMRR=0.917 · R@1=0.860 · R@10=1.000 · $0.020179
HotpotQA takeaway: This is the harder dataset — multi-hop queries require finding information across multiple documents. At q=0, adapters score below MiniLM raw. This is exactly where quality calibration matters. The calibrate endpoint would detect this and recommend a higher quality threshold (q=30 for Qwen3, q=60 for MiniLM) to ensure you never degrade below your source model. On this dataset, the quality routing dial is essential — and it works smoothly, with each step buying predictable improvement.

Key Thresholds

The quality level where each adapter hits important milestones.

MilestoneDatasetMiniLM→TE3Qwen3-0.6B→TE3What it means
Beats source modelNQq=0 (0%)q=0 (0%)Adapter improves over raw MiniLM with zero routing
Beats source modelHotpotQAq=60 (63%)q=30 (40%)Harder dataset needs some routing to beat source
Matches TE3 (±1%)NQq=90 (81%)q=40 (38%)Qwen3 hits parity with 62% of queries still local
Matches TE3 (±1%)HotpotQAq=80 (90%)q=60 (85%)Adapters reach parity at 98%+ savings
Reading this table: Lower quality threshold = more queries stay local = cheaper. Qwen3 consistently reaches milestones at lower thresholds — meaning better accuracy for the same cost. MiniLM is 15× faster though, so choose based on whether you're optimizing for throughput or accuracy.

The Full Picture — Cost vs Quality

Both datasets, log-scale cost axis. Adapters clustered at the left, OpenAI isolated at 50–78× the price.

Hero chart

Methodology

Setup: Corpus documents are embedded once with OpenAI text-embedding-3-large (3072 dimensions). Queries are embedded with each adapter at every quality threshold (0–100 in steps of 10). Retrieval uses cosine similarity.

Metrics: MRR@10 (mean reciprocal rank in top 10), R@k (recall at k — is the gold answer in the top k results?). Cost is the actual API cost for embedding queries only (corpus cost is one-time and excluded).

Datasets: Natural Questions (Google search factoid Q&A, 100 queries, 1,000 passages) and HotpotQA (multi-hop reasoning, 100 queries, 1,190 passages from Wikipedia).

Hardware: Latency and throughput benchmarks run on an NVIDIA RTX 3060 Laptop GPU (6GB VRAM). API response times measured via FastAPI on localhost.

Reproducibility: All benchmarks use fixed random seeds. The evaluation script is available on request.

Ready to see how your data performs?

Get API key → Dashboard →