Phase 11 - Lesson 04

Embeddings & Vector Representations

This lesson includes a graded coding exercise that runs in your browser, unlocked with lifetime access.

Text is discrete. Math is continuous. Every time you ask an LLM to find "similar" documents, compare meanings, or search beyond keywords, you're relying on a bridge between these two worlds. That bridge is an embedding. If you don't understand embeddings, you don't understand modern AI. You just use it.

Type: Build Languages: Python Prerequisites: Phase 11, Lesson 01 (Prompt Engineering) Time: ~75 minutes Related: Phase 5 · 22 (Embedding Models Deep Dive) covers dense vs sparse vs multi-vector, Matryoshka truncation, and per-axis model selection. This lesson focuses on the production pipeline (vector DBs, HNSW, similarity math). Read Phase 5 · 22 before picking a model.

Learning Objectives

Generate text embeddings using API providers and open-source models, and compute cosine similarity between them
Explain why embeddings solve the vocabulary mismatch problem that keyword search cannot handle
Build a semantic search index that retrieves documents by meaning rather than exact keyword match
Evaluate embedding quality using retrieval benchmarks (precision@k, recall) and choose the right embedding model for your task

The Problem

You have 10,000 support tickets. A customer writes "my payment didn't go through." You need to find similar past tickets. Keyword search finds tickets containing "payment" and "didn't go through." It misses "transaction failed," "charge was declined," and "billing error." These tickets describe the exact same problem with completely different words.

This is the vocabulary mismatch problem. Human language has dozens of ways to say the same thing. Keyword search treats each word as an independent symbol with no meaning. It cannot know that "declined" and "didn't go through" refer to the same concept.

You need a representation of text where meaning, not spelling, determines similarity. You need a way to place "my payment didn't go through" and "transaction was declined" close together in some mathematical space, while pushing "my payment arrived on time" far away despite sharing the word "payment."

That representation is an embedding.

The Concept

What Is an Embedding?

An embedding is a dense vector of floating-point numbers that represents the meaning of text. The word "dense" matters -- every dimension carries information, unlike sparse representations (bag-of-words, TF-IDF) where most dimensions are zero.

"The cat sat on the mat" becomes something like [0.023, -0.041, 0.087, ..., 0.012] -- a list of 768 to 3072 numbers depending on the model. These numbers encode meaning. You never inspect them directly. You compare them.

The Word2Vec Breakthrough

In 2013, Tomas Mikolov and colleagues at Google published Word2Vec. The core insight: train a neural network to predict a word from its neighbors (or neighbors from a word), and the hidden layer weights become meaningful vector representations.

The famous result:

king - man + woman = queen

Vector arithmetic on word embeddings captures semantic relationships. The direction from "man" to "woman" is roughly the same as the direction from "king" to "queen." This was the moment the field realized that geometry could encode meaning.

Word2Vec produced 300-dimensional vectors. Each word got one vector regardless of context. "Bank" in "river bank" and "bank account" had the same embedding. This limitation drove the next decade of research.

From Words to Sentences

Word embeddings represent single tokens. Production systems need to embed entire sentences, paragraphs, or documents. Four approaches emerged:

Averaging: take the mean of all word vectors in the sentence. Cheap, lossy, surprisingly decent for short text. Loses word order entirely -- "dog bites man" and "man bites dog" get identical embeddings.

CLS token: transformer models (BERT, 2018) output a special [CLS] token embedding that represents the entire input. Better than averaging but the [CLS] token was trained for next-sentence prediction, not similarity.

Contrastive learning: train the model explicitly to push similar pairs together and dissimilar pairs apart. Sentence-BERT (Reimers & Gurevych, 2019) used this approach and became the foundation for modern embedding models. Given "How do I reset my password?" and "I need to change my password," the model learns these should have nearly identical vectors.

Instruction-tuned embeddings: the latest approach. Models like E5 and GTE accept a task prefix ("search_query:", "search_document:") that tells the model what kind of embedding to produce. This lets one model serve multiple tasks.

graph LR
    subgraph "2013: Word2Vec"
        W1["king"] --> V1["[0.2, -0.1, ...]"]
        W2["queen"] --> V2["[0.3, -0.2, ...]"]
    end

    subgraph "2019: Sentence-BERT"
        S1["How do I reset my password?"] --> E1["[0.04, 0.12, ...]"]
        S2["I need to change my password"] --> E2["[0.05, 0.11, ...]"]
    end

    subgraph "2024: Instruction-Tuned"
        I1["search_query: password reset"] --> T1["[0.08, 0.09, ...]"]
        I2["search_document: To reset your password, click..."] --> T2["[0.07, 0.10, ...]"]
    end

Modern Embedding Models

The market has settled into a handful of production-grade options (MTEB scores as of early 2026, MTEB v2):

Model	Provider	Dimensions	MTEB	Context	Cost / 1M tokens
Gemini Embedding 2	Google	3072 (Matryoshka)	67.7 (retrieval)	8192	$0.15
embed-v4	Cohere	1024 (Matryoshka)	65.2	128K	$0.12
voyage-4	Voyage AI	1024/2048 (Matryoshka)	66.8	32K	$0.12
text-embedding-3-large	OpenAI	3072 (Matryoshka)	64.6	8192	$0.13
text-embedding-3-small	OpenAI	1536 (Matryoshka)	62.3	8192	$0.02
BGE-M3	BAAI	1024 (dense+sparse+ColBERT)	63.0 multilingual	8192	Open-weight
Qwen3-Embedding	Alibaba	4096 (Matryoshka)	66.9	32K	Open-weight
Nomic-embed-v2	Nomic	768 (Matryoshka)	63.1	8192	Open-weight

MTEB (Massive Text Embedding Benchmark) v2 covers 100+ tasks across retrieval, classification, clustering, reranking, and summarization. Higher is better. By 2026, open-weight models (Qwen3-Embedding, BGE-M3) match or beat closed hosted models on most axes. Gemini Embedding 2 leads pure retrieval; Voyage/Cohere lead specific domains (finance, law, code). Always benchmark on your own queries before committing.

Similarity Metrics

Given two embedding vectors, three ways to measure how similar they are:

Cosine similarity: the cosine of the angle between two vectors. Ranges from -1 (opposite) to 1 (identical direction). Ignores magnitude -- a 10-word sentence and a 500-word document can score 1.0 if they point the same direction. This is the default for 90% of use cases.

cosine_sim(a, b) = dot(a, b) / (||a|| * ||b||)

Dot product: the raw inner product of two vectors. Identical to cosine similarity when vectors are normalized (unit length). Faster to compute. OpenAI's embeddings are normalized, so dot product and cosine give the same ranking.

dot(a, b) = sum(a_i * b_i)

Euclidean (L2) distance: straight-line distance in the vector space. Smaller = more similar. Sensitive to magnitude differences. Use when the absolute position in space matters, not just the direction.

L2(a, b) = sqrt(sum((a_i - b_i)^2))

When to use which:

Metric	Use when	Avoid when
Cosine similarity	Comparing texts of different lengths; most retrieval tasks	Magnitude carries information
Dot product	Embeddings are already normalized; maximum speed	Vectors have varying magnitudes
Euclidean distance	Clustering; spatial nearest-neighbor problems	Comparing documents of wildly different lengths

Vector Databases and HNSW

A brute-force similarity search compares the query against every stored vector. At 1 million vectors with 1536 dimensions, that is 1.5 billion multiply-add operations per query. Too slow.

Vector databases solve this with Approximate Nearest Neighbor (ANN) algorithms. The dominant algorithm is HNSW (Hierarchical Navigable Small World):

Build a multi-layer graph of vectors
Top layers are sparse -- long-range connections between distant clusters
Bottom layers are dense -- fine-grained connections between nearby vectors
Search starts at the top layer, greedily descending to refine
Returns approximate top-k results in O(log n) time instead of O(n)

HNSW trades a small accuracy loss (typically 95-99% recall) for massive speed gains. At 10 million vectors, brute force takes seconds. HNSW takes milliseconds.

graph TD
    subgraph "HNSW Layers"
        L2["Layer 2 (sparse)"] -->|"long jumps"| L1["Layer 1 (medium)"]
        L1 -->|"shorter jumps"| L0["Layer 0 (dense, all vectors)"]
    end

    Q["Query vector"] -->|"enter at top"| L2
    L0 -->|"nearest neighbors"| R["Top-k results"]

Production options:

Database	Type	Best for	Max scale
Pinecone	Managed SaaS	Zero-ops production	Billions
Weaviate	Open source	Self-hosted, hybrid search	100M+
Qdrant	Open source	High performance, filtering	100M+
ChromaDB	Embedded	Prototyping, local dev	1M
pgvector	Postgres extension	Already using Postgres	10M
FAISS	Library	In-process, research	1B+

Chunking Strategies

Documents are too long to embed as single vectors. A 50-page PDF covers dozens of topics -- its embedding becomes an average of everything, similar to nothing specific. You split documents into chunks and embed each one.

Fixed-size chunking: split every N tokens with M-token overlap. Simple and predictable. Works well when documents have no clear structure. A 512-token chunk with 50-token overlap: chunk 1 is tokens 0-511, chunk 2 is tokens 462-973.

Sentence-based chunking: split at sentence boundaries, grouping sentences until reaching the token limit. Each chunk is at least one complete sentence. Better than fixed-size because you never cut a thought in half.

Recursive chunking: try splitting at the largest boundary first (section headers). If still too large, try paragraph boundaries. Then sentence boundaries. Then character limits. This is LangChain's RecursiveCharacterTextSplitter and it works well for mixed-format corpora.

Semantic chunking: embed each sentence, then group consecutive sentences whose embeddings are similar. When the embedding similarity drops below a threshold, start a new chunk. Expensive (requires embedding every sentence individually) but produces the most coherent chunks.

Strategy	Complexity	Quality	Best for
Fixed-size	Low	Decent	Unstructured text, logs
Sentence-based	Low	Good	Articles, emails
Recursive	Medium	Good	Markdown, HTML, mixed docs
Semantic	High	Best	Critical retrieval quality

The sweet spot for most systems: 256-512 token chunks with 50-token overlap.

Bi-Encoders vs Cross-Encoders

A bi-encoder embeds the query and documents independently, then compares vectors. Fast -- you embed the query once and compare against pre-computed document embeddings. This is what you use for retrieval.

A cross-encoder takes the query and a document as a single input and outputs a relevance score. Slow -- it processes each query-document pair through the full model. But far more accurate because it can attend across query and document tokens simultaneously.

The production pattern: bi-encoder retrieves top-100 candidates, cross-encoder reranks them to top-10. This is the retrieve-then-rerank pipeline.

graph LR
    Q["Query"] --> BE["Bi-Encoder: embed query"]
    BE --> VS["Vector search: top 100"]
    VS --> CE["Cross-Encoder: rerank"]
    CE --> R["Top 10 results"]

Reranking models: Cohere Rerank 3.5 (

Term	What people say	What it actually means
Embedding	"Text to numbers"	A dense vector where geometric proximity encodes semantic similarity
Word2Vec	"The OG embedding"	2013 model that learned word vectors by predicting context words; proved vector arithmetic encodes meaning
Cosine similarity	"How similar are two vectors"	Cosine of the angle between vectors; 1 = identical direction, 0 = orthogonal, -1 = opposite
HNSW	"Fast vector search"	Hierarchical Navigable Small World graph -- multi-layer structure enabling O(log n) approximate nearest neighbor search
Bi-encoder	"Embed separately, compare fast"	Encodes query and document independently into vectors; enables pre-computation and fast retrieval
Cross-encoder	"Slow but accurate reranker"	Processes query-document pair jointly through the full model; higher accuracy, no pre-computation
Matryoshka embeddings	"Truncatable vectors"	Embeddings trained so the first N dimensions capture the most important information, enabling variable-size storage
Binary quantization	"1-bit embeddings"	Converting float vectors to binary (sign bit only) for 32x storage reduction with Hamming distance search
Chunking	"Split docs for embedding"	Breaking documents into 256-512 token segments so each can be independently embedded and retrieved
Vector database	"Search engine for embeddings"	Data store optimized for storing vectors and performing approximate nearest neighbor search at scale
Contrastive learning	"Train by comparison"	Training approach that pushes similar pair embeddings together and dissimilar pair embeddings apart
MTEB	"The embedding benchmark"	Massive Text Embedding Benchmark -- 56 datasets across 8 tasks; standard for comparing embedding models

Embeddings & Vector Representations

Learning Objectives

The Problem

The Concept

What Is an Embedding?

The Word2Vec Breakthrough

From Words to Sentences

Modern Embedding Models

Similarity Metrics

Vector Databases and HNSW

Chunking Strategies

Bi-Encoders vs Cross-Encoders

Matryoshka Embeddings

Binary Quantization

Build It

Step 1: Text Chunking

Step 2: Building Embeddings from Scratch

Step 3: Similarity Functions

Step 4: Vector Index with Brute-Force Search

Step 5: The Semantic Search Engine

Step 6: Comparing Similarity Metrics

Use It

Ship It

Exercises

Key Terms

Further Reading