We swapped IVF+PQ for IVFFlat and got 4× recall back

We measured. On a 100k-vector SIFT benchmark, IVF+PQ topped out at recall@10 = 0.65 even when we probed every cell. IVFFlat at nprobe=64 hit 0.99. We shipped the swap.

The setup

Same caller code, same dataset, same nlist. Both indexes built with FAISS. The only thing changing is what lives inside each inverted-file cell: PQ residual codes vs raw float32 vectors.

The "default" RosalindDB shard used to be IVF+PQ. ~16× smaller in RAM than the float32 source vectors. That number is what every "production scale vector DB" blog post is selling you, including the FAISS index-selection guidelines when memory is "quite important" or "very important." So we shipped it.

Then we ran recall benchmarks.

The numbers

Index	Params	recall@10
IVF+PQ	`nprobe=1`	0.22
IVF+PQ	`nprobe=nlist` (all cells)	~0.65
IVFFlat	`nprobe=64`	~0.99

Read that middle row again. nprobe=nlist means we are doing a brute-force scan over every cell in the index. There is no more search to do. 0.65 is the ceiling, not the floor. Cranking nprobe higher cannot move it.

That is the load-bearing observation in this post.

Why the IVF+PQ ceiling is structural

IVF gives you a coarse partition: assign each vector to one of nlist Voronoi cells, search the nprobe cells closest to your query. The IVF half is lossless. You can recover any neighbour by probing enough cells. Worst case, probe all of them.

PQ is where the information is destroyed. Inside a cell, the residual (vector minus centroid) is sliced into M subvectors and each subvector is replaced by the index of the nearest codeword in a 256-entry codebook. One byte per subvector. For a 128-dim float32 vector with M=16, you are encoding 128 × 32 = 4096 bits of input as 16 × 8 = 128 bits of output. You threw away 97% of the bits. The decoded vector is the codebook reconstruction, not the original.

Recall@10 is bounded above by how often the 10 true nearest neighbours, after PQ decoding, are still the 10 nearest neighbours by reconstructed distance. That is a property of the codebook and the data distribution, not of nprobe. Search thoroughness only helps you reach the cell. Once you are there, you are ranking ghosts.

This is the part the "always use PQ" advice tends to skip. Probing more cells helps you find the right neighbourhood. It does not help you tell two close vectors apart inside that neighbourhood after both have been mashed into 128-bit codes. If the workload needs recall@10 > 0.9, PQ is the wrong knob and no amount of nprobe will save you. Matthijs Douze and the ann-benchmarks Pareto frontier both show the same shape: PQ variants live on a different curve than the lossless ones, and you cannot tune from one to the other.

You can mitigate it. OPQ rotates the data first to make subvector independence cleaner. IVFPQ+R re-ranks the top-N candidates against the raw vectors, which means you are now paying the IVFFlat memory cost on the candidate set anyway. Both are real techniques. Neither closes the gap to lossless on 128-dim SIFT-class data at our scale.

What we gave up

A 1M-vector 128-dim shard is now ~512 MB instead of ~32 MB. The byte-budgeted shard cache (RB_SHARD_CACHE_BYTES, default 512 MB) caps how many shards are resident per Query-DP process. Fewer shards per process means more processes, or more cold-load latency when a shard gets evicted and a query for it lands.

We took that hit on purpose. RosalindDB is object-storage-first — cold shards live in S3/MinIO, not on the query host — so the worst case for an evicted shard is a fetch from object storage, not a recompute. The cache is a working-set cache, not the source of truth. Trading 16× memory for 4× recall on the hottest path was an easy call. Trading it on every shard is harder, but the architecture absorbs it.

We also accept that some single-node, billion-vector use cases are now off the table on a single Query-DP. They were always going to need horizontal sharding. We just stopped pretending otherwise.

When IVF+PQ is still the right index

PQ earns its keep when the binding constraint is RAM and recall@0.65 is acceptable. Billion-vector single-node fleets. Recommendation candidate generation where the downstream re-ranker fixes ordering. Workloads where "a relevant hit" matters and "the exact closest hit" does not. The FAISS guidelines are not wrong — they assume you have decided memory is the dominant cost. For RAG, agent memory, and internal semantic search, it usually is not. The vectors are not that big and the recall floor is not that low.

The mistake is generalising the billion-scale recsys advice to the 10M-vector RAG case. They are different workloads.

Closing

The default in RosalindDB is now IVFFlat at nprobe=64, overridable per request and via RB_QUERY_NPROBE. If recall@10 of 0.65 is fine for your workload, IVF+PQ is a great index and you should use it. Most people I have talked to did not realise that was the trade they were making.

Before you pick an index, run the recall benchmark on your actual data and your actual queries. What is the lowest recall@10 your downstream system tolerates before the answers get visibly worse?

We measured. On a 100k-vector SIFT benchmark, IVF+PQ topped out at recall@10 = 0.65 even when we probed every cell. IVFFlat at nprobe=64 hit 0.99. We shipped the swap.

The setup

Same caller code, same dataset, same nlist. Both indexes built with FAISS. The only thing changing is what lives inside each inverted-file cell: PQ residual codes vs raw float32 vectors.

Then we ran recall benchmarks.

The numbers

Index	Params	recall@10
IVF+PQ	`nprobe=1`	0.22
IVF+PQ	`nprobe=nlist` (all cells)	~0.65
IVFFlat	`nprobe=64`	~0.99

That is the load-bearing observation in this post.

Why the IVF+PQ ceiling is structural

What we gave up

We also accept that some single-node, billion-vector use cases are now off the table on a single Query-DP. They were always going to need horizontal sharding. We just stopped pretending otherwise.

When IVF+PQ is still the right index

The mistake is generalising the billion-scale recsys advice to the 10M-vector RAG case. They are different workloads.

Closing

Before you pick an index, run the recall benchmark on your actual data and your actual queries. What is the lowest recall@10 your downstream system tolerates before the answers get visibly worse?