For software engineers who know what an index is and what an embedding is.
Start Here: A Question You Think You Already Know the Answer To
You have a chunks table with 50,000 rows. Each row has a text column and a vector column — a list of 1536 floating point numbers representing the semantic meaning of that text.
A user query arrives. You embed it into another 1536-dimensional vector. Now you want the most semantically similar chunks.
Your instinct: Just put an index on the vector column.
Before you do that — stop. Ask yourself: what property of a B-tree index makes it fast?
A B-tree works because values have a natural, universal sort order. “Mumbai” always comes after “Kolkata” regardless of what you’re searching for. That stable ordering lets the database eliminate half the search space at every level — giving you O(log n) instead of O(n).
Now here is the uncomfortable question:
Can you sort 1536-dimensional vectors the same way?
Why B-tree Fails: Similarity Has No Universal Order
Cosine similarity — the measure of closeness between two vectors — is not a property of a single vector. It is a relationship between two vectors.
The similarity ranking of your 50,000 chunks changes completely with every new query. A chunk about “Typhoid symptoms” ranks highest for one query and near the bottom for another.
There is no universal ordering. No “this chunk always comes before that chunk.” B-tree’s entire superpower — pre-sorted, query-independent structure — simply does not exist here.
B-tree is dead for vector search. So what’s left?
The Honest Fallback: Brute Force
Without a pre-sorted structure, the only option is to compare the query vector against every single chunk vector — compute cosine similarity 50,000 times and return the top results.
Let’s calculate the real cost:
50,000 chunks
× 1,536 dimensions (floating point operations per comparison)
= 76,800,000 operations per query
At 1,000 concurrent users, that’s 76 billion operations per second. Every time someone asks a question.
This is the full table scan of the vector world — correct, exact, and completely unacceptable at scale.
Something has to give. But what?
The Key Insight: You Don’t Need the Perfect Answer
Here is where semantic search differs fundamentally from relational queries.
In a relational database, being wrong is catastrophic. Returning the wrong user_id breaks everything. Exactness is non-negotiable.
In semantic search, ask yourself: if exact search returns the chunk ranked #1 with similarity 0.921, and an approximate method returns the chunk ranked #2 with similarity 0.918 — does the student’s answer actually get worse?
Almost never. Because:
- The LLM synthesises answers from multiple chunks anyway
- Chunks about the same topic cluster closely in vector space — their similarity scores are near-identical
- “Close enough” is genuinely good enough
This trade-off has a name: Approximate Nearest Neighbour (ANN) search.
Sacrifice occasional exactness. Gain dramatic speed. For semantic search, this is a completely valid trade.
The “how often do we find the true nearest neighbour” is measured by recall. A recall of 0.95 means 95% of queries return the true best chunk. For RAG pipelines, this is more than sufficient.
Now the real engineering question: how do you build an index that avoids comparing against all 50,000 chunks?
The Graph Idea: Navigate Like a Network
Imagine you move to a new city and want to find people with similar interests. You have two strategies:
- Strategy A: Interview all 10 million residents one by one
- Strategy B: Find one person with similar interests, ask them who they know, and follow the network
Strategy B is faster for one simple reason: similarity clusters. People with similar interests know other people with similar interests. You navigate through a connected graph toward your target rather than scanning everyone.
The same logic applies to vectors. Semantically similar chunks live near each other in vector space. Build a proximity graph — each chunk node connects to its most similar neighbouring chunks — and you can navigate toward your query vector by following edges.
This is the Navigable Small World (NSW) idea. Even with only local connections, a well-connected graph lets you reach any node from any other node in a surprisingly small number of hops.
But this flat graph has a serious weakness.
The Flat Graph Problem: You Dropped In at the Wrong Neighbourhood
With a flat proximity graph, every node connects only to its nearest neighbours. If your query is about “Typhoid symptoms” but you start your traversal at a node about “antibiotic resistance” — you are making dozens of small local hops to cross a large semantic distance.
The starting point is essentially random, and local connections alone cannot navigate large distances efficiently.
How do you solve a navigation problem that requires both long-range and short-range movement?
Think about how you navigate geographically. You don’t open a street-level map of the entire country when driving from Mumbai to a specific street in Delhi. You:
- First zoom out — find the right city
- Zoom in — find the right neighbourhood
- Zoom in further — find the exact street
Each level uses a different scale of connections. Highways at the top. Local roads at the bottom.
Apply the same hierarchy to your proximity graph.
Hierarchical Layers: The H in HNSW
Build multiple layers of graphs stacked on top of each other:
Layer 3 (top): ~50 nodes — long-range connections, coarse navigation
Layer 2: ~500 nodes — medium-range connections
Layer 1: ~5,000 nodes — shorter-range connections
Layer 0 (base): 50,000 nodes — all chunks, fine-grained local search
Search always starts at the top layer, navigates to the best entry point, then drops into the next layer — using the landing point from above as the starting point below. Each layer narrows the search region. By the time you reach Layer 0, you are already in the right semantic neighbourhood and only need local hops to find the nearest chunk.
This is Hierarchical Navigable Small World — HNSW.
One question remains: which nodes get promoted to upper layers?
Not the “most important” ones. Not the most central ones. Content-based selection is dangerous — it could leave entire topics unrepresented in upper layers, crippling recall for those topics.
The answer is elegantly simple: random selection via a probability decay function.
Every node → enters Layer 0
~10% of nodes → randomly promoted to Layer 1
~1% of nodes → randomly promoted to Layer 2
~0.1% of nodes → randomly promoted to Layer 3
Randomness guarantees unbiased coverage. No topic gets systematically excluded. No human judgement needed about what is “important.”
How Search Actually Works
When a query arrives:
- Start at Layer 3 — compare against the handful of nodes there, find the closest one
- Drop to Layer 2 — use that node as the entry point, greedily hop to closer neighbours
- Drop to Layer 1 — same process, narrowing into the right semantic region
- Drop to Layer 0 — now do a thorough local search among nearby chunks
- Return top-k results
At no point do you compare against all 50,000 chunks. You navigate toward the answer through a hierarchy of connected graphs.
The Three Trade-offs You’re Actually Making
HNSW is not free. Here is the honest cost:
| Brute Force | HNSW | |
|---|---|---|
| Query speed | Slow (O(n×d)) | Fast (O(log n)) |
| Exactness | Yes | No (approximate) |
| Build cost | None | High — graph construction at ingest time |
| Memory cost | Low | Higher — stores graph connections alongside vectors |
| Works for semantic search | Yes, slowly | Yes, efficiently |
The Three Knobs
HNSW exposes three parameters that let you tune the recall-speed trade-off:
M — Number of connections per node (graph density)
- Higher M → denser graph → better recall → more memory and slower build
ef_construction — How many candidate neighbours are evaluated when building the graph
- Higher ef_construction → better quality graph → slower index build
ef_search — How many candidates are evaluated at query time
- Higher ef_search → better recall → slower queries
- Lower ef_search → faster queries → occasionally misses the true nearest neighbour
Practical Guidance: Which Knob to Turn First
If your recall drops below your acceptable threshold after evaluation:
Step 1 — Increase ef_search first.
Zero rebuild cost. Takes effect immediately. Trades a few milliseconds of query latency for better recall. Try doubling it and re-evaluate.
Step 2 — Only if that’s not enough, increase M and ef_construction.
This requires a full index rebuild. For 50,000 chunks embedded with text-embedding-3-large, that means re-running the entire embedding pipeline — time-consuming and potentially costly.
Always exhaust query-time tuning before paying the build-time cost.
The Full Picture in One Glance
Why not B-tree?
→ No universal sort order for vectors
(similarity is relative to each query)
Why not brute force?
→ ~77M operations per query, unacceptable at scale
Why ANN?
→ Semantic search doesn't need exactness
"Good enough" chunk → same answer quality
Trade recall for speed
Why graph-based (NSW)?
→ Navigate similarity space like a network
Follow connections toward query region
Why hierarchical (H)?
→ Flat graph = too many hops across large distances
Multiple layers = coarse navigation first, fine search last
Upper layers selected randomly → unbiased topic coverage
The three knobs:
M → graph density
ef_construction → build-time thoroughness
ef_search → query-time thoroughness (tune this first)
The One-Line Summary
HNSW trades a small, controlled amount of exactness — which semantic search doesn’t need anyway — for search that is orders of magnitude faster than scanning every vector, by navigating a hierarchical graph from coarse to fine.
Your B-tree knew where to go because values had order. HNSW knows where to go because similar things connect to similar things — and some of those connections skip far ahead.
If you’re implementing this in pgvector, the default HNSW parameters are m=16 and ef_construction=64. Start with ef_search=40 and tune upward if your recall evaluation shows gaps.