Shares words with the query - but not meaning.
Zero shared words. Same meaning.
TakeawayTraditional search matches tokens. Vector search matches meaning.
Each model turns its input into an array of numbers - the embedding's position in high-dimensional space. Anywhere from a few hundred to a few thousand Float32 values.
RAG
Ground LLMs in your own documents
Agent memory
Recall the right past conversation
Legal analysis
Surface relevant case law
Fraud detection
Spot the needle in a stack of needles
Song matching
Identify a tune from a whistle
Visual search
Find products that look like this photo
Autonomous driving
Detect erratic lane changes
Molecular discovery
Find molecules with similar shape
Cancer screening
Match diagnostic images to known cases
How do you measure similarity in multi-dimensional space?
Compare the query to every entity in the database. Exact, simple, O(N).
16 faces, fine. A billion vectors? Not so much. Latency grows linearly and ~all of the comparisons are waste.
Every technique trades speed, accuracy & cost.
Of all the good stuff, how much did we find?
Of what we returned, how much was good?
Production notesRecall@k can be calculated against brute force / exact match results
ThoughtWhat would happen if we filtered by release year?
IVF clusters the vectors into nlist cells. At query time, only search within the nearest nprobe cells.
| Param | What it does | Bigger means | When it's set | Good default |
|---|---|---|---|---|
nlist |
number of Voronoi cells, set at build | finer cells, slower build, more centroids in RAM | Build | √N |
nprobe |
cells searched per query | higher recall, slower query | Query | 8 - 16 |
Recall too lowRaise
nprobefirst. If it plateaus,nlistis too high for your data: rebuild with fewer cells.
Too slowLower
nprobe. SwitchIVF_FLAT→IVF_SQ/IVF_PQto shrink each cell scan.
Memory / build costLower
nlist, or useIVF_PQto compress the vectors inside each cell.
Hierarchical Navigable Small World. Multi-layer graph: top layers have long-range highways, lower layers have local connections. Start at the top, walk greedily closer, drop down a layer, repeat.
| Param | What it does | Bigger means | When it's set | Good default |
|---|---|---|---|---|
M |
edges per node | better recall, more RAM, slower build | Build | 16 |
efConstruction |
candidate-list width during build | better graph quality, slower build | Build | 200 |
ef |
candidate-list width per query (≥ k) | higher recall, slower query | Query | 64 |
Recall too lowRaise
effirst (no rebuild needed). Still short? IncreaseMandefConstruction, then rebuild.
Too slowLower
ef. A higherMlets a lowerefhit the same recall at the cost of higher RAM footprint.
MemoryLower
M, or 2 - 32x savings with quantisation.
Graph index, engineered for SSD. Minimises random reads, index billions of vectors on ~GBs of RAM.
| Param | What it does | Bigger means | When it's set | Good default |
|---|---|---|---|---|
max_degree |
graph out-degree (R) | better recall, larger index, slower build | Build | 56 |
search_list_size |
build-time beam width (L) | better graph quality, slower build | Build | 100 |
search_list |
candidate list per query (≥ k) | higher recall, more SSD reads, slower query | Query | 100 |
Recall too lowRaise
search_listfirst. If it plateaus, rebuild with a highermax_degree.
Too slow / spiky p99Lower
search_list. SSD random-read IOPS is the bottleneck - make sure you're on NVMe!
Approximate nearest-neighbour algorithms all trade perfection for reduced latency and cost.
HNSW, IVF, DiskANN, nlist, nprobe, M, ef, search_list. Can't the machine work it out?
You tune · open-source Milvus / other VectorDB
nlist, M / efConstruction, graph degreenprobe, ef - and re-tune as data shiftsAUTOINDEX decides · managed
level dial (1 - 10), default targets ~90% recallTrade-offFull control and full responsibility, or one dial and trust the engine.
512 dimensions → a 16×32 grid → hue based on normalised dimension value → a fingerprint for each face.
The fingerprint isn't decoration - it is the geometry. Close vectors share a pattern; distant ones don't.
Indexes make search fast. Quantisation makes vectors small. Round float32 → int8: 4× smaller embeddings, a small recall hit, almost no work.
Cheapest winNo training, no codebook - just rescale each value into a byte. 4× smaller, and most indexes support it out of the box.
A float32 vector becomes one byte per dimension - step through the moves.
The recent breakthrough: rotate the space, then keep just the sign of each dimension - one bit. The bit-vector preserves angles with a provable error bound, and a cheap correction term sharpens the estimate. Paired with the RaBitQ index in Milvus: up to 32× compression.
Milvus 2.6 · 1M × 768-D1-bit alone: 32× smaller, recall 0.76. Refine / rescore and recall recovers to 0.95 - at ~4× the throughput of full-precision flat.
Rotate the space, then keep one bit per dimension - step through the moves.
Scalar quantisation shrinks every number a little. PQ shrinks the whole vector a lot.
512 floats collapse to 8 IDs: a barcode.
WarningPQ leans on a static codebook - learned once, it degrades quietly under model drift.
A 512-D vector becomes eight centroid IDs - step through the four moves.
Every lost bit risks recall, but the curve is surprisingly forgiving.
Each algorithm can use quantisation to trade accuracy for significantly reduced latency and cost.
PCA finds the directions of greatest variance and keeps the top k. Fewer dimensions, full precision.
BenefitKeep one number instead of two and 94% of the variance - linear, fast, deterministic.
DrawbackMaximises for variance, not meaning: structure on a low-variance axis is discarded, and it must be refit when the data shifts.
MRL tunes the model so the dimensions are ordered by importance. OpenAI's text-embedding-3-large is 3072-D native, but you can ask for any prefix down to 256-D via the dimensions parameter. The trade-off defers to query time.
BenefitOne model, pick the length per query - short prefix to shortlist fast, full vector to re-rank. Degrades gracefully.
DrawbackOnly works if the model was trained this way - truncate an ordinary embedding and recall falls off a cliff (the berry line).
Funnel retrieval with Matryoshka embeddings · milvus.io/blog
The dimensions are ordered by importance, so a prefix is a complete vector - step through the moves.
Dimensionality reduction nudges any index toward fast and cheap at once.
Build time compression and dimensionality reduction both trade accuracy to buy speed and scale. Refinement wins accuracy back at query time.
Milvus built-inSet
refine: trueat build, tunerefine_kat query. Supported on RaBitQ, PQ and SQ indexes.
SuperpowerZilliz uses this technique for indexing external tables, for on-demand lakebase compute.
PCA and Matryoshka trade accuracy for speed and cost. Refinement spends a little of both to buy accuracy back - the same triangle, travelled in reverse.
Trivial in SQL. On a graph index, the obvious fix quietly backfires 😞
The catchThe harder you filter, the more of the graph you destroy. So there's no single fix - the right technique depends on how much survives the filter.
How much of your data survives the filter decides the strategy.
High selectivity (few pass) → Medium → Low selectivity (most pass)
High · brute force
The filter leaves only a handful of candidates. Skip the graph entirely and compute exact distances over the survivors - cheap because the set is tiny, and 100% recall.
Medium · filter-aware graph
Bake the filter labels into graph construction - the alpha pruning parameter keeps matching nodes reachable. You traverse only valid nodes without fragmenting the index.
Low · post-filter
Almost everything passes, so search the full graph and drop the few non-matches afterward. Over-fetch a little to backfill your k.
What modern engines doZilliz watches selectivity per query and picks the strategy automatically - so you stay connected and accurate across the whole range.
SQL / Lucene · fails loudly
EXPLAIN hands you the plan - which index, which scan, what it costVector search · fails silently
The gapSQL fails loudly. Vector search fails silently - so we build the instrumentation back ourselves.
You can't eyeball recall. You need a number - and you need it on every deploy.
Build a golden set: freeze a sample of real queries, compute their true neighbours once with exact brute-force - the O(N) scan from the start of this talk. That's your ground truth. Then score the production index against it - recall@k, continuously.
Maintaining a golden set is work. Zilliz Cloud can compute recall@k for you, per query.
// POST /v2/vectordb/entities/search
{ "data": [[0.12, -0.04, ...]], // query, embedded
"limit": 10, // k = 10
"searchParams": { "level": 6, "enableRecallCalculation": true } }
// → response
{ "code": 0,
"data": [
{ "distance": 0.912, "title": "The Terminator" }, // ✓ relevant
{ "distance": 0.874, "title": "Terminator 3" }, // ✓ relevant
{ "distance": 0.861, "title": "I, Robot" }, // ✗ off-theme
// … 7 more …
],
"recalls": [0.667] } // 4 of 6 true neighbours in top-10
HowIt runs your search twice - once at your
level, once in a high-precision mode that stands in as ground truth. The brute-force comparison from the last slide, done for you, per query.
| Symptom | Likely cause | Where to look |
|---|---|---|
| Recall drops, latency flat | Index params drifted, or the data outgrew them | Raise nprobe / ef search effort |
| Recall drops right after a deploy | The embedding model changed | Full reindex - old and new vectors aren't comparable |
| Fine in tests, wrong in production | Filtering | Pre- vs post-filter; a selective filter wrecked the index |
| Scores all clustered, none confident | Cross-modal miscalibration | Normalise per modality; add a re-ranker |
| Recall erodes slowly over weeks | Concept drift - the world moved on | Refresh embeddings; watch the golden set |
| Memory or cost spiked | Quantisation / index misconfigured | Compression level vs your recall budget |
A database throws an error, an agent won't.
Feed a RAG pipeline or an agent degraded results and nothing crashes. It just gets a bit worse, every time.
The failure never surfaces as a failure.
It surfaces as "the assistant got dumber"
Catch it hereInstrument retrieval itself - recall@k, score spread, filter hit-rate - and watch it before the agent ever consumes the results.
Pro-tipUse refinement and semantic highlighting to defend against poor results and high token usage.