You shipped a vector DB to production. Indexes are humming, hybrid search is dialed, replicas are sized. The dashboard is green.
Then the bill arrived.
Four decision tools you can use against your real bill on Monday:
Concept-first, with pymilvus for the concrete shape.
100M × 768 dims × 4 bytes/float = 307 GB of raw vectors.
That's the floor. Your HNSW graph adds 30–80% on top. Replicas multiply everything.
RAM is roughly $5/GB/month at AWS list prices. Do the math: a billion 768-dim vectors at full precision, three replicas, is ~$50K/month just for RAM. Before you've served a single query.
quantisation is how you push that ceiling down. Often by 8×. Sometimes by 32×.
Product quantisation. Chop a vector into m sub-vectors. For each sub-vector position, learn a small codebook (typically 256 entries — one byte). Store each vector as m codebook indices.
A 768-dim FP32 vector is 3072 bytes raw. With m=16, nbits=8, it's 16 bytes — a 192× compression ratio.
The recall cost: each sub-vector is approximated to its nearest codebook entry. Distances become approximate. The whole point of the next two slides is the knobs that control how approximate.
| Param | What it does | Bigger means |
|---|---|---|
m |
number of sub-vectors | finer approximation, slower build, marginally bigger code |
nbits |
bits per sub-code (codebook size = 2^nbits) | bigger codebook, better recall, more RAM for codebooks |
Compression ratio: (d × 32) / (m × nbits). A 768-dim vector at m=16, nbits=8 compresses 192×.
Reasonable starting point: m=16, nbits=8. Bump m if recall is short, drop it if RAM is tight.
There are diminishing returns past roughly m=64 on most datasets — at that point each sub-vector covers about 12 dimensions, well within the codebook's expressive range. Exact sweet spot is dataset-dependent; if in doubt, benchmark before pushing m higher.
Scalar quantisation. Quantize each individual dimension to fewer bits — fp16 (2× compression, basically lossless), int8 (4× compression, ~99% recall on most embeddings).
No codebook. No build cost worth mentioning. No knobs.
Reach for SQ first. If 4× compression is enough — and for a lot of workloads it is — you're done. PQ and binary are for when SQ leaves money on the table.
The reason most production decks skip this slide: it's boring. Boring is the point.
One bit per dimension. Distance becomes Hamming distance, computed with a CPU popcount instruction — single-cycle on modern hardware.
768-dim FP32 → 96 bytes. 32× compression.
The catch: recall craters on its own. Vanilla binary on dense embeddings tops out around 0.7–0.85 recall@10 — unusable for most applications.
Almost nobody runs binary as the final stage. The pattern is: binary for fast candidate generation (top-1000), then exact rescore against the full-precision vectors. This two-stage approach keeps the 32× compression win while recovering near-FP32 recall on the final ranked list.
PQ carves the vector space into equal sub-dimensions and quantizes each independently. The assumption: every sub-dimension carries roughly equal information.
Real embeddings break that assumption. Variance concentrates along a few principal directions. The rest is near-zero — those PQ cells sit empty, their codebook entries never assigned to any vector.
Pre-multiply every vector by a learned rotation matrix R. One matrix multiply per vector, done offline. The rotation spreads variance evenly across all sub-dimensions — no dimension starved, no codebook entry wasted.
At query time: one matrix multiply, then standard PQ distance. The rotation is free at search time if you pre-rotate the query.
The pattern that keeps recall intact at a fraction of the RAM. Step 1: run quantized ANN search (binary or PQ) to retrieve top-N candidates (e.g., top-1000). Step 2: exact rescore those candidates against the original full-precision vectors (which can live on SSD or even cold storage — only N are fetched).
The key insight: you only need full-precision for the small candidate set, not for the whole index. Most production systems running binary quantisation or aggressive PQ already use this pattern implicitly. Approximate cost: rescore is cheap if N is small — 1000 rescores at FP32 costs ~0.3ms on a modern core.
Trade-offs across quantisation strategies - nothing comes for free!
Numbers are illustrative — representative of Ada-002/BGE-m3-style embeddings.
Pick the simplest option that meets your recall floor.
| Technique | Milvus index/type | Compression | Recall cost | Build cost | Best for |
|---|---|---|---|---|---|
| FP32 | FLAT, IVF_FLAT, HNSW |
1× | none | none | ground truth, reranking, refiner stage |
| FP16 / BF16 | FLOAT16_VECTOR, BFLOAT16_VECTOR |
2× | <1% | none | free win when embeddings are already half-precision |
| INT8 | INT8_VECTOR |
4× | ~1–2% | none (client-side) | models that natively output int8 (e.g. Cohere int8) |
| SQ8 | IVF_SQ8, HNSW_SQ |
4× | ~1% | negligible | first move, almost always worth it |
| PQ | IVF_PQ, HNSW_PQ |
4–32× | 3–20% | minutes–hours | RAM-constrained, billion-scale |
| PRQ | HNSW_PRQ |
8–32× | 2–15% | minutes–hours | better recall than PQ at the same ratio |
| RaBitQ (1-bit + rerank) | IVF_RABITQ |
32× | <2% (two-stage) | fast | candidate generation, extreme scale |
from pymilvus import MilvusClient, DataType
client = MilvusClient("milvus.db")
# FP32 — full precision, maximum RAM
client.create_index("docs", "vector",
{"index_type": "HNSW", "metric_type": "COSINE",
"params": {"M": 16, "efConstruction": 200}})
# SQ8 — 4× compression, ~99% recall
client.create_index("docs", "vector",
{"index_type": "HNSW", "metric_type": "COSINE",
"params": {"M": 16, "efConstruction": 200},
"quantisation_type": "SQ8"})
# OPQ-PQ — up to 192× compression
client.create_index("docs", "vector",
{"index_type": "IVF_PQ", "metric_type": "COSINE",
"params": {"nlist": 4096, "m": 16, "nbits": 8}})
Re-quantize on model swap: Each embedding model has its own distribution. Switching from text-embedding-ada-002 to text-embedding-3-large without retraining codebooks silently degrades recall — sometimes by 20%+. Always retrain after a model upgrade.
Codebook drift on streaming ingest: PQ codebooks are trained offline on a snapshot. As the distribution shifts (new data, seasonal patterns), recall silently erodes. Schedule periodic codebook retraining or monitor recall continuously.
Recall measured on training data: If you benchmark recall on the same vectors used to train the codebook, you get an optimistic number. Measure on a held-out set — ideally the same query distribution as production.
Three tiers, three orders-of-magnitude cost difference:
PQ summary vectors live in RAM (~12 bytes per vector instead of 3072). The Vamana graph lives on NVMe, fetched on graph traversal. Latency: 1–5ms p99 vs <1ms for HNSW-in-RAM. Cost: ~10× cheaper per vector than hot tier.
At 100M × 768-dim vectors:
| RAM usage | NVMe | Monthly cost | |
|---|---|---|---|
| HNSW in RAM | ~9 GB | — | ~$45 |
| DiskANN | ~1.5 GB (PQ codes) | ~18 GB | ~$9.30 |
The key lever is the hot fraction — the fraction of your corpus that accounts for most queries. Zipfian distributions are common: 10% of vectors take 90% of queries. You only need RAM for that hot 10%. The rest can be on NVMe or colder.
On a query, the coordinator checks cache before going to disk or object store.
Two knobs swing $/recall by 5×: mmap.enabled and the cache warmup strategy.
from pymilvus import MilvusClient
client = MilvusClient("milvus.db")
# Enable mmap for a collection (warm tier — OS-managed)
client.alter_collection_properties("docs", {
"mmap.enabled": True
})
# Configure segment cache (NVMe cache size, hot segments)
# In milvus.yaml:
# queryNode:
# cache:
# warmup: async # pre-load hot segments on startup
# chunkMemoryFactor: 4.0 # grow cache to 4× chunk size
Wins:
Losses:
itopk_size parameter (larger = better recall, more compute).A GPU node costs more per hour than a CPU node. It also handles more QPS per dollar — but only above a threshold. The question isn't "GPU vs CPU" — it's "at what QPS does GPU become cheaper per query?"
Below the break-even QPS:
Above the break-even QPS:
The next slide shows the actual crossover for HNSW vs CAGRA on current AWS pricing.
Toggle the index type to see how the break-even QPS shifts between HNSW and CAGRA.
Pick the simplest option that meets your workload shape.
| Workload | QPS | Recommendation |
|---|---|---|
| Heavy index build (>10B vectors) | any | GPU (CAGRA) — 10–50× build speedup |
| High-QPS search | >3,000 QPS | GPU (CAGRA) — $/query break-even |
| Low-QPS or latency-sensitive | <500 QPS | CPU fleet — GPU idle tax too high |
| Steady-state mixed | 500–3,000 QPS | Benchmark both; CAGRA may win above ~3K QPS |
Build the CAGRA index on GPU and route high-QPS search to a dedicated GPU resource group.
from pymilvus import MilvusClient
client = MilvusClient("milvus.db")
# Build a CAGRA index on GPU
client.create_index("docs", "vector", {
"index_type": "GPU_CAGRA",
"metric_type": "COSINE",
"params": {
"intermediate_graph_degree": 64,
"graph_degree": 32,
}
})
# Pin GPU workloads to a dedicated resource group
# (requires Milvus resource group config)
client.update_resource_groups({
"gpu_search_group": {
"requests": {"nodeNum": 1},
"limits": {"nodeNum": 1},
"node_filter": {"node_labels": {"gpu": "true"}}
}
})
One resource group per GPU node type lets you route high-QPS search to GPU while keeping CPU for build-heavy jobs.
RAM dominates — replica count is a direct multiplier on this number.
Reason about any vector DB bill with three terms:
monthly_cost ≈ (vectors × bytes/vector × replicas × $/GB/mo) ← RAM
+ (QPS / per_node_QPS × $/node/hr × 730) ← compute
+ (total_bytes × $/GB/mo) ← storage
This fits on a whiteboard. The two knobs that move the needle: bytes/vector (quantisation) and replicas.
100M × 768-dim dataset. Three configs, three very different bills.
| Config | Index size | RAM × 3 replicas | $/month |
|---|---|---|---|
| FP32 (baseline) | 307 GB | 921 GB | ~$4,605 |
| SQ8 (4× compression) | ~77 GB | ~231 GB | ~$1,155 |
| OPQ-PQ m=16 (192×) | ~1.6 GB | ~4.8 GB | ~$24 |
quantisation is the largest single lever in the cost formula. Pull it first.
The 10/90 rule: 10% of vectors absorb 90% of queries. The rest can live on NVMe.
Same dataset: 100M × 768-dim, SQ8, 3 replicas.
| Config | Monthly cost |
|---|---|
| 100% in RAM | ~$1,155 |
| 10% hot (RAM) + 90% warm (NVMe) | ~$137 |
~8× cost reduction at acceptable p99. The warm tier adds 1–5 ms to p99 latency — invisible to most applications.
How to find your hot fraction: query logs, access-count metadata, or recency windows. Milvus MMap lets you pin hot segments in RAM and spill the rest transparently.
Every replica is a full copy of the index in RAM. Three replicas means three RAM bills.
| Replicas | RAM cost multiplier | What you actually need |
|---|---|---|
| 1× | $X | Dev, batch jobs, single-AZ |
| 2× | $2X | HA + double the QPS headroom |
| 3× | $3X | Multi-AZ HA, or ~3× peak QPS |
The formula: replicas ≥ ceil(target_QPS / per_replica_QPS) + 1 (one for failure domain).
Example: one node handles 500 QPS, target is 800 QPS. You need ceil(800/500) + 1 = 3 replicas — but only if you need a failure-domain spare. If not, 2 is enough, and you just saved 33% of your RAM bill.
Don't size for peak fear. Benchmark your per-replica QPS first.
The honest comparison changes once you account for both denominators in the compute term.
Self-hosted Milvus (OSS)
Zilliz Cloud (managed)
The math: if Cardinal gives you 5× QPS per node, you need one-fifth the nodes. That can more than offset the managed premium. Model it with your actual QPS target and per-node benchmark before assuming self-hosted is cheaper.
Adjust any input to see the monthly cost breakdown and the biggest lever for reducing it.
The trilogy payoff. Same shape as 201's recall-latency curve — but the y-axis is cost.
| Config | Recall | Monthly cost | Position |
|---|---|---|---|
| FP32 | 1.00 | ~$4,605 | Top-right: perfect recall, full price |
| SQ8 | ≈0.99 | ~$1,155 | Middle: 75% cheaper, nearly same recall |
| OPQ-PQ m=16 | ≈0.85 | ~$24 | Bottom-left: 99% cheaper, recall trade-off |
Every point on this frontier is optimal for someone. Pick based on your recall floor:
This is what 301 adds to the trilogy: 101 was about meaning, 201 was about production, 301 is about the bill.
Egress fees creep up fast
Cross-AZ or cross-region replication traffic rarely appears on the initial cost estimate. Three replicas across different AZs, each returning 100 GB of query results per day — that's a meaningful egress line item. Model it before you choose a topology.
Idle dev clusters are silent budget leaks
A staging vector DB running 24/7 at full capacity "for testing" can be 20–30% of the production bill. Scale down or spin down dev clusters between test runs. Milvus Lite or a single-replica cluster at SQ8 covers most staging needs.
Optimizing $/vector instead of $/query
Storing vectors cheaply matters less than the cost per search. A 32× compressed index that forces 3× more QPS capacity to meet your recall SLA isn't cheaper — it's more expensive. Always evaluate cost at the query level, not the storage level.
1. Quantize aggressively
SQ8 is almost always worth it — under 1% recall loss, 4× RAM reduction, done. OPQ-PQ if you need to go further. The RAM line is the ceiling; lower it first.
2. Tier by access pattern
Identify your hot fraction — usually 10–20% of vectors taking 80–90% of queries. Archive the cold tail to NVMe or object store. A 10% hot fraction translates to up to ~8× cost cut at acceptable p99.
3. Right-size replicas to actual QPS
Not peak fear. Calculate ceil(target_QPS / per_replica_QPS) and add one for your failure-domain requirement — no more. One extra replica costs as much as all your storage. Benchmark the per-replica number before adding nodes.
This is the trilogy ender. From here:
$/query alert and you have the two metrics that actually catch cost incidentsQuantize. Tier. Right-size. Re-measure quarterly.