Faster KNN search in Manticore: 2-pass HNSW, batched distances, and AVX-512
Faster KNN search in Manticore: 2-pass HNSW, batched distances, and AVX-512
By Sergey Nikolaev | June 2026
TL;DR: By implementing three strategic modifications to the HNSW search engine, Manticore has boosted KNN throughput by as much as 29% for high values, with performance gains exceeding 20% under concurrent workloads.
Overview of Manticore's KNN Evolution
Manticore utilizes hnswlib, an open-source implementation of the Hierarchical Navigable Small World (HNSW) algorithm, as the foundation for its K-Nearest Neighbor (KNN) search.
In the past, development primarily focused on:
- Implementing custom distance functions (e.g., for binary quantization).
- Adding prefiltering capabilities via
standard methodsACORN-1. - Introducing early termination logic.
While these additions were valuable, the core search loop remained untouched. The way hnswlib traversed neighbors, calculated distances, and managed candidate sets was stagnant. The latest updates change this by restructuring the search loop to optimize how the engine interacts with the CPU's memory hierarchy and reduces overhead.
The Three Primary Targets for Optimization
- Memory Access: Reducing inefficient patterns.
- Data Loading: Eliminating redundant loads.
- Function Calls: Removing indirect call overhead.
1. Compile-time Distance Function Specialization
Previously, the system relied on a runtime function pointer stored within the HNSW index. This pointer was called for every single candidate processed.
The Problem:
- Indirect Calls: These prevent the compiler from performing inlining.
- Branch Prediction: Frequent indirect calls create overhead for the CPU's branch predictor.
The Solution: Manticore now utilizes C++ templates to resolve distance functions at compile time. The process works as follows:
- A
switchstatement identifies the correct template specialization based on the quantization settings and the chosen metric. - The entire inner loop (traversal computation update) is executed as a single, monolithic function.
This allows the compiler to optimize instruction scheduling, register allocation, and loop unrolling across the distance calculation boundary.
2. 2-Pass Neighbor Processing
In the original HNSW implementation, neighbors were processed in a linear, single-pass fashion:
Check Visited Fetch Vector Compute Distance Update Candidates.
Because the vector data was fetched immediately before use, memory prefetch hints had insufficient time to hide latency. The new approach splits this into two distinct phases:
- Pass 1: Iterates through all neighbors, filters out those already visited, and populates a small batch array. Crucially, it issues a prefetch hint for the vector data of each unvisited neighbor.
- Pass 2: Processes the compact sequential array of IDs. Since the data was prefetched in Pass 1, it is more likely to be residing in the cache, reducing wait times.
Note: For queries without a WHERE clause (unfiltered), a "fast path" is used to skip the per-candidate filter check entirely.
3. Batched Distance Computation
The 2-pass architecture enables efficient batching. Instead of scoring candidates one by one, the engine now scores them in pairs.
Efficiency Gain: When calculating distances for two candidates simultaneously, the query vector is loaded into a register once and reused for both computations. This eliminates redundant loads of the query-side data.
Supported Metrics: The "Batch-2" functions are implemented for:
- Inner Product
- (Euclidean) distance
- Binary-quantized versions of the above
4. AVX-512 Integration
To further accelerate the math, Manticore has introduced AVX-512 support, which doubles the float processing capacity per iteration (16 floats vs. 8 in AVX2).
Technical Implementation
| Feature | Implementation Detail |
|---|---|
| Core Loop | Uses _mm512_fmadd_ps (Fused Multiply-Add) for and Inner Product. |
| Binary Quantization | Utilizes the VPOPCNTDQ extension for faster bit-counting. |
| Deployment | Ships as three variants: Baseline, AVX2, and AVX-512. |
At startup, Manticore automatically detects the CPU's instruction set and loads the most optimized library variant.
Benchmark Results
Test Environment
The following configuration was used to isolate the impact of these changes:
- Dataset:
dbpedia-openai-1M-1536-angular(1M vectors, 1536 dimensions, cosine distance). - Hardware: AMD Ryzen 7 9700X (Zen 5, 8C/16T).
- Settings:
- 1-bit binary quantization.
- Oversampling and rescoring:
Disabled. - Early termination:
Disabled.
Why Zen 5? Zen 5 was selected because it features native 512-bit datapaths. This avoids the "split-512" execution and the aggressive frequency downclocking often seen in older Intel AVX-512 implementations, ensuring the results reflect algorithmic gains rather than CPU throttling.
Key Findings
The team compared the new AVX2 build against the previous AVX2 build to isolate the algorithmic improvements (2-pass, batching, and templates) from the hardware acceleration (AVX-512).
- Increased Throughput: Up to 29% gain at high .
- Concurrency Scaling: Over 20% improvement under multi-threaded load.
- Reduced Latency: Better cache utilization via prefetching.
