Phase 05 - Lesson 22
Embedding Models — The 2026 Deep Dive
This lesson includes a graded coding exercise that runs in your browser, unlocked with lifetime access.
Word2Vec gave you a vector per word. Modern embedding models give you a vector per passage, cross-lingual, with sparse, dense, and multi-vector views, sized to fit your index. Pick wrong and your RAG retrieves the wrong thing.
Type: Learn Languages: Python Prerequisites: Phase 5 · 03 (Word2Vec), Phase 5 · 14 (Information Retrieval) Time: ~60 minutes
The Problem
Your RAG system retrieves the wrong passage 40% of the time. The culprit is rarely the vector database or the prompt. It is the embedding model.
Choosing an embedding in 2026 means picking across five axes:
- Dense vs sparse vs multi-vector. One vector per passage, or one per token, or a sparse weighted bag of words.
- Language coverage. Monolingual English models still win on English-only tasks. Multilingual models win when corpora are mixed.
- Context length. 512 tokens vs 8,192 vs 32,768 — and real effective capacity is often 60-70% of the advertised max.
- Dimension budget. 3,072 floats at full precision = 12 KB per vector. At 100M vectors, storage is ,300/month. Matryoshka truncation cuts this 4×.
- Open vs hosted. Open-weight means you control the stack and data. Hosted means you trade control for always-latest.
This lesson names the tradeoffs so you can pick on evidence, not on whatever was popular last quarter.
The Concept
Dense embeddings. One vector per passage (usually 384-3,072 dimensions). Cosine similarity ranks passages by semantic proximity. OpenAI
text-embedding-3-large, BGE-M3 dense mode, Voyage-3. Default choice.Sparse embeddings. SPLADE-style. A transformer predicts a weight for every vocab token, then zeros out most of them. Result is a sparse vector of size |vocab|. Captures lexical matching (like BM25) but with learned term weights. Strong on keyword-heavy queries.
Multi-vector (late interaction). ColBERTv2, Jina-ColBERT. One vector per token. Scoring with MaxSim: for each query token, find the most similar document token, sum the scores. More expensive to store and score, but wins on long queries and domain-specific corpora.
BGE-M3: all three at once. Single model outputs dense, sparse, and multi-vector representations simultaneously. Each can be queried independently; scores fuse via weighted sum. The 2026 default when you want flexibility from one checkpoint.
Matryoshka Representation Learning. Trained so the first N dimensions of the vector form a useful standalone embedding. Truncate a 1,536-dim vector to 256 dim and pay ~1% accuracy for 6× storage savings. Supported by OpenAI text-3, Cohere v4, Voyage-4, Jina v5, Gemini Embedding 2, Nomic v1.5+.
The MTEB leaderboard tells a partial story
Massive Text Embedding Benchmark — 56 tasks across 8 task types at launch (2022), expanded to 100+ tasks in MTEB v2. In early 2026, Gemini Embedding 2 tops retrieval (67.71 MTEB-R). Cohere embed-v4 leads general (65.2 MTEB). BGE-M3 leads open-weight multilingual (63.0). The leaderboard is necessary but not sufficient — always benchmark on your domain.
The three-tier pattern
Use case Pattern Fast first-pass Dense bi-encoder (BGE-M3, text-3-small) Recall boost Sparse (SPLADE, BGE-M3 sparse) + RRF fuse Precision on top-50 Multi-vector (ColBERTv2) or cross-encoder reranker Most production stacks use all three.
Build It
Step 1: baseline — dense embeddings with Sentence-BERT
from sentence_transformers import SentenceTransformer import numpy as np encoder = SentenceTransformer("BAAI/bge-small-en-v1.5") corpus = [ "The first iPhone launched in 2007.", "Apple released the iPod in 2001.", "Android is an operating system from Google.", ] emb = encoder.encode(corpus, normalize_embeddings=True) query = "When was the iPhone released?" q_emb = encoder.encode([query], normalize_embeddings=True)[0] scores = emb @ q_emb print(sorted(enumerate(scores), key=lambda x: -x[1]))normalize_embeddings=Truemakes the dot product equal cosine similarity. Always set it.Step 2: Matryoshka truncation
def truncate(vectors, dim): out = vectors[:, :dim] return out / np.linalg.norm(out, axis=1, keepdims=True) emb_256 = truncate(emb, 256) emb_128 = truncate(emb, 128)Re-normalize after truncation. Nomic v1.5, OpenAI text-3, and Voyage-4 are trained so this is lossless for the first few levels. Non-Matryoshka models (original Sentence-BERT) degrade sharply when truncated.
Step 3: BGE-M3 multi-functionality
from FlagEmbedding import BGEM3FlagModel model = BGEM3FlagModel("BAAI/bge-m3", use_fp16=True) output = model.encode( corpus, return_dense=True, return_sparse=True, return_colbert_vecs=True, ) # output["dense_vecs"]: (n_docs, 1024) # output["lexical_weights"]: list of dict {token_id: weight} # output["colbert_vecs"]: list of (n_tokens, 1024) arraysThree indexes, one inference call. Score fusion:
dense_score = ... # cosine over dense_vecs sparse_score = model.compute_lexical_matching_score(q_lex, d_lex) colbert_score = model.colbert_score(q_col, d_col) final = 0.4 * dense_score + 0.2 * sparse_score + 0.4 * colbert_scoreTune the weights on your domain.
Step 4: MTEB eval on a custom task
from mteb import MTEB tasks = ["ArguAna", "SciFact", "NFCorpus"] evaluation = MTEB(tasks=tasks) results = evaluation.run(encoder, output_folder="./mteb-results")Run your candidate models on a representative subset. Do not trust leaderboard rank alone — your domain matters.
Step 5: hand-rolled cosine from scratch
See
code/main.py. Averaged Hashing Trick embeddings (stdlib-only). Not competitive with transformer embeddings, but shows the shape: tokenize → vector → normalize → dot product.Pitfalls
- Same model for query and doc. Some models (Voyage, Jina-ColBERT) use asymmetric encoding — query and document pass through different paths. Always check the model card.
- Missing prefix.
bge-*models need"Represent this sentence for searching relevant passages: "prepended to queries. 3-5 point recall gap if you forget. - Over-trimming Matryoshka. 1,536 → 256 is usually safe. 1,536 → 64 is not. Validate on your eval set.
- Context truncation. Most models silently truncate inputs over their max length. Long docs need chunking (see lesson 23).
- Ignoring latency tail. MTEB scores hide p99 latency. A 600M model might beat a 335M model by 2 points but cost 3× more per query.
Use It
The 2026 stack:
Situation Pick English-only, fast, API text-embedding-3-largeorvoyage-3-largeOpen-weight, English BAAI/bge-large-en-v1.5Open-weight, multilingual BAAI/bge-m3orQwen3-Embedding-8BLong context (32k+) Voyage-3-large, Cohere embed-v4, Qwen3-Embedding-8B CPU-only deployment Nomic Embed v2 (137M params, MoE) Storage-constrained Matryoshka-truncated + int8 quantization Keyword-heavy queries Add SPLADE sparse, RRF-fuse with dense 2026 pattern: start with BGE-M3 or text-3-large, evaluate on your domain with MTEB, swap if a domain-specific model wins by more than 3 points.
Ship It
Save as
outputs/skill-embedding-picker.md:--- name: embedding-picker description: Pick embedding model, dimension, and retrieval mode for a given corpus and deployment. version: 1.0.0 phase: 5 lesson: 22 tags: [nlp, embeddings, retrieval] --- Given a corpus (size, languages, domain, avg length), deployment target (cloud / edge / on-prem), latency budget, and storage budget, output: 1. Model. Named checkpoint or API. One-sentence reason. 2. Dimension. Full / Matryoshka-truncated / int8-quantized. Reason tied to storage budget. 3. Mode. Dense / sparse / multi-vector / hybrid. Reason. 4. Query prefix / template if required by the model card. 5. Evaluation plan. MTEB tasks relevant to domain + held-out domain eval with nDCG@10. Refuse recommendations that truncate Matryoshka to <64 dims without domain validation. Refuse ColBERTv2 for corpora under 10k passages (overhead not justified). Flag long-document corpora (>8k tokens) routed to models with 512-token windows.Exercises
- Easy. Encode 100 sentences with
bge-small-en-v1.5at full dim (384), then at Matryoshka 128. Measure MRR drop on 10 queries. - Medium. Compare BGE-M3 dense, sparse, and colbert on 500 passages from your domain. Which wins on recall@10? Does RRF fusion beat the best single mode?
- Hard. Run MTEB on three candidate models across your top-2 domain tasks. Report MTEB score, p99 latency on a 100-query batch, and $/1M queries. Pick the Pareto-optimal one.
Key Terms
Term What people say What it actually means Dense embedding The vector One fixed-size vector per text. Cosine similarity for ranking. Sparse embedding Learned BM25 One weight per vocab token; mostly zeros; trained end-to-end. Multi-vector ColBERT-style One vector per token; MaxSim scoring; bigger index, better recall. Matryoshka Russian doll trick First N dims are a valid smaller embedding on their own. MTEB The benchmark Massive Text Embedding Benchmark — 56 tasks at launch, 100+ in v2. BEIR The retrieval benchmark 18 zero-shot retrieval tasks; often cited for cross-domain robustness. Asymmetric encoding Query ≠ doc path Model uses different projections for queries and documents. Further Reading
- Reimers, Gurevych (2019). Sentence-BERT — the bi-encoder paper.
- Muennighoff et al. (2022). MTEB: Massive Text Embedding Benchmark — the leaderboard paper.
- Chen et al. (2024). BGE-M3: Multi-lingual, Multi-functionality, Multi-granularity — the unified three-mode model.
- Kusupati et al. (2022). Matryoshka Representation Learning — the dimension-ladder training objective.
- Santhanam et al. (2022). ColBERTv2: Effective and Efficient Retrieval via Lightweight Late Interaction — late interaction in production.
- MTEB leaderboard on Hugging Face — live rankings.