Two queries. Same intent. Zero overlap.
| Query | Inverted index match |
|---|---|
car |
docs containing "car" |
automobile |
docs containing "automobile" |
A keyword index has no idea these mean the same thing. Synonym lists patch a few cases. They don't scale to meaning.
Not "which documents contain these characters?"
But "which documents are about the same thing as my query?"
The trick: turn meaning into numbers — then meaning becomes geometry.
"automobile" ──► [embedding model] ──► [0.12, -0.85, 0.43, ... 768 numbers]
"car" ──► [embedding model] ──► [0.14, -0.81, 0.40, ... 768 numbers]
Same model, similar meaning → similar vector. That's it. That's the whole idea.
The model is pre-trained. You feed it text, get back a fixed-length list of floats — usually 256, 768, or 1536 of them.
Every one of these reduces to the same primitive: find the nearest vectors.
Pick a word. See its five nearest neighbours.
Words from the same category cluster together. The model learned this from text alone — nobody told it "cat" and "dog" are both animals.
A vector is a list of numbers. A list of numbers is a point in space.
You can't picture 768 dimensions. The math doesn't care. Distance is still distance.
Two points close together → similar meaning.
Two points far apart → different meaning.
The embedding model spends all its training compute making sure this property holds: things that mean the same end up near each other in vector space.
| Metric | What it measures | Use when |
|---|---|---|
| Cosine | angle between vectors | text embeddings (the default) |
| Euclidean (L2) | straight-line distance | image embeddings, geometry |
| Inner product | dot product, magnitude matters | recommender scores, ranking |
Pick one and stick with it — your index has to be built for the metric you query with.
For every vector in the database, compute its distance to the query. Sort the list. Return the top k.
That's it. That's the algorithm.
It's also the slowest possible way to do this.
100M vectors × 768 dims × 4 bytes/float = 307 GB.
Every query touches every byte. At memory-bandwidth limits, that's seconds per query — on hardware that can serve a thousand HTTP requests in the same time.
Linear search dies somewhere around 100K–1M vectors. After that, you need an index.
Approximate Nearest Neighbour (ANN) indexes — HNSW, IVF, ScaNN — trade a tiny bit of recall (~1%) for 100×–1000× speedup.
The idea: pre-organize the vector space so a query only has to look at a small, well-chosen subset.
You give up "guaranteed top-5". You get sub-100ms queries over a billion vectors. Worth it.
Click anywhere to drop a query point. Slide k. Toggle the metric.
Cosine cares about direction from the origin. Euclidean cares about position. Same data, different neighbours.
A vector database does exactly four things:
Everything else — replication, sharding, GPU acceleration, hybrid search — is in service of doing these four well.
You can. pgvector adds a vector column type and a few index methods.
It's fine for <1M vectors with relaxed latency. Past that:
Postgres is great. It's just not built for the workload.
Three independently scaled tiers — ingest, index, query — sharing one segment store.
from pymilvus import MilvusClient
client = MilvusClient(uri="./milvus.db")
client.create_collection(collection_name="docs", dimension=768)
client.insert(collection_name="docs", data=[
{"id": 1, "vector": embed("the quick brown fox"), "text": "..."},
])
That's it. Five lines, including the import.
results = client.search(
collection_name="docs",
data=[embed("a fast tan-coloured canine")],
limit=3,
output_fields=["text"],
)
results[0] is your top-3 by similarity, with the original text attached. The whole API surface is roughly a dozen methods.
numpy.argsort over a matrix is fasterReach for a vector DB when you have scale × latency × always-on. Otherwise, simpler is faster.
Pick a query. Watch the cosine similarity scores against ten pre-embedded documents.
The top three are highlighted. This is the entire RAG retrieval step, in one chart.
Embed-and-search is the cheap, fast part. The LLM call is the expensive part. The vector DB controls what the LLM sees.
RAG keeps the prompt small and relevant. Big context windows make RAG better, not obsolete.
Vector DBs can be 40–50% of your AI app bill at scale.
Most of that is RAM (indexes want to be in memory) and replicas (you want HA). It's possible to cut this by 5–10× with disk-based indexes, quantization, and tiered storage — but only if you architect for it from day one.
The 301 talk covers cost engineering end-to-end.
| Tool | Sweet spot |
|---|---|
| Milvus | Open-source, distributed, billion-scale, GPU-accelerated |
| Zilliz Cloud | Managed Milvus |
| Pinecone | Managed-only, simple API, smaller-scale |
| Weaviate | Built-in modules (transformers, RAG), good DX |
| Qdrant | Single-binary, Rust, great filtering |
| pgvector | "I already have Postgres" |
All five do ANN. Differences are operations, scale, and developer experience.
If this clicked, the 201 talk goes deep on production systems:
Then 301 covers cost engineering: quantization, tiered storage, GPU economics.