Faster KNN search in Manticore: 2-pass HNSW, batched distances, and AVX-512

By Sergey Nikolaev | June 2026

TL;DR: By implementing three strategic modifications to the HNSW search engine, Manticore has boosted KNN throughput by as much as 29% for high $k$ values, with performance gains exceeding 20% under concurrent workloads.

Overview of Manticore's KNN Evolution

Manticore utilizes hnswlib, an open-source implementation of the Hierarchical Navigable Small World (HNSW) algorithm, as the foundation for its K-Nearest Neighbor (KNN) search.

In the past, development primarily focused on:

Implementing custom distance functions (e.g., for binary quantization).
Adding prefiltering capabilities via ~~standard methods~~ ACORN-1.
Introducing early termination logic.

While these additions were valuable, the core search loop remained untouched. The way hnswlib traversed neighbors, calculated distances, and managed candidate sets was stagnant. The latest updates change this by restructuring the search loop to optimize how the engine interacts with the CPU's memory hierarchy and reduces overhead.

The Three Primary Targets for Optimization

Memory Access: Reducing inefficient patterns.
Data Loading: Eliminating redundant loads.
Function Calls: Removing indirect call overhead.

1. Compile-time Distance Function Specialization

Previously, the system relied on a runtime function pointer stored within the HNSW index. This pointer was called for every single candidate processed.

The Problem:

Indirect Calls: These prevent the compiler from performing inlining.
Branch Prediction: Frequent indirect calls create overhead for the CPU's branch predictor.

The Solution: Manticore now utilizes C++ templates to resolve distance functions at compile time. The process works as follows:

A switch statement identifies the correct template specialization based on the quantization settings and the chosen metric.
The entire inner loop (traversal $\rightarrow$ computation $\rightarrow$ update) is executed as a single, monolithic function.

This allows the compiler to optimize instruction scheduling, register allocation, and loop unrolling across the distance calculation boundary.

2. 2-Pass Neighbor Processing

In the original HNSW implementation, neighbors were processed in a linear, single-pass fashion: Check Visited $\rightarrow$ Fetch Vector $\rightarrow$ Compute Distance $\rightarrow$ Update Candidates.

Because the vector data was fetched immediately before use, memory prefetch hints had insufficient time to hide latency. The new approach splits this into two distinct phases:

Pass 1: Iterates through all neighbors, filters out those already visited, and populates a small batch array. Crucially, it issues a prefetch hint for the vector data of each unvisited neighbor.
Pass 2: Processes the compact sequential array of IDs. Since the data was prefetched in Pass 1, it is more likely to be residing in the cache, reducing wait times.

Note: For queries without a WHERE clause (unfiltered), a "fast path" is used to skip the per-candidate filter check entirely.

3. Batched Distance Computation

The 2-pass architecture enables efficient batching. Instead of scoring candidates one by one, the engine now scores them in pairs.

Efficiency Gain: When calculating distances for two candidates simultaneously, the query vector is loaded into a register once and reused for both computations. This eliminates redundant loads of the query-side data.

Supported Metrics: The "Batch-2" functions are implemented for:

Inner Product
$L_2$ (Euclidean) distance
Binary-quantized versions of the above

4. AVX-512 Integration

To further accelerate the math, Manticore has introduced AVX-512 support, which doubles the float processing capacity per iteration (16 floats vs. 8 in AVX2).

Technical Implementation

Feature	Implementation Detail
Core Loop	Uses `_mm512_fmadd_ps` (Fused Multiply-Add) for $L_2$ and Inner Product.
Binary Quantization	Utilizes the `VPOPCNTDQ` extension for faster bit-counting.
Deployment	Ships as three variants: `Baseline`, `AVX2`, and `AVX-512`.

At startup, Manticore automatically detects the CPU's instruction set and loads the most optimized library variant.

Benchmark Results

Test Environment

The following configuration was used to isolate the impact of these changes:

Dataset: dbpedia-openai-1M-1536-angular (1M vectors, 1536 dimensions, cosine distance).
Hardware: AMD Ryzen 7 9700X (Zen 5, 8C/16T).
Settings:
- 1-bit binary quantization.
- Oversampling and rescoring: Disabled.
- Early termination: Disabled.

Why Zen 5? Zen 5 was selected because it features native 512-bit datapaths. This avoids the "split-512" execution and the aggressive frequency downclocking often seen in older Intel AVX-512 implementations, ensuring the results reflect algorithmic gains rather than CPU throttling.

Key Findings

The team compared the new AVX2 build against the previous AVX2 build to isolate the algorithmic improvements (2-pass, batching, and templates) from the hardware acceleration (AVX-512).

Increased Throughput: Up to 29% gain at high $k$ .
Concurrency Scaling: Over 20% improvement under multi-threaded load.
Reduced Latency: Better cache utilization via prefetching.

User Profile Sergey Nikolaev