ZenANN is an approximate nearest neighbor (ANN) similarity search library for Python developers with multiple optimization variants. It provides several indexing methods including IVF (Inverted File Index), HNSW (Hierarchical Navigable Small World), and KD-Tree for exact search.
Build Variants:
- naive: Baseline version with no optimizations (single-threaded, scalar operations)
- openmp: Multi-threaded parallelization using OpenMP
- simd: SIMD vectorization using AVX2 intrinsics
- full: Complete optimization with OpenMP + SIMD (default)
- cuda: GPU acceleration with CUDA kernels
- profiling: Full version with detailed performance profiling enabled
All variants provide the same API and functional correctness, differing only in performance characteristics.
ZenANN serves as both a production-ready library and a teaching tool for understanding parallel optimization techniques in vector similarity search.
Similarity search is a fundamental problem in many domains, including information retrieval, natural language processing, and recommendation systems. The challenge is to efficiently find the nearest neighbors of a query vector in high-dimensional space.
Approximate nearest neighbor (ANN) search trades off a small loss in accuracy for significant speed improvements. This implementation provides:
- Correctness: All algorithms produce accurate results across all build variants
- Performance: Multiple optimization levels from baseline to fully optimized
- Flexibility: Choose the appropriate variant for your use case
- Educational value: Compare performance impact of different optimization techniques
Implemented Optimizations:
- Multi-threading with OpenMP (centroid search, list probing, batch queries)
- SIMD vectorization with AVX2 (L2 distance calculations)
- GPU acceleration with CUDA (IVF-Flat search kernels)
- Conditional compilation for easy performance comparison
CUDA Implementation Highlights:
- Full GPU-accelerated IVF-Flat search pipeline
- Hybrid kernel strategy: fast shared-memory kernel for small k, heap-based kernel for large k
- Automatic kernel selection based on shared memory requirements
- Supports k up to 100 with all nprobe values (1-64)
- Achieves 82K+ QPS on SIFT1M dataset (k=10, nprobe=1)
ZenANN is ideal for:
- Students learning about ANN algorithms and parallel programming optimization
- Researchers comparing different optimization techniques and their performance impact
- Educators teaching high-performance computing with real-world examples
- Developers needing a flexible ANN library with controllable optimization levels
- Data Scientists requiring vector similarity search in Python applications
ZenANN is implemented in C++ with a Python API using pybind11.
An abstract base class provides a unified interface for different index types:
-
IndexBase - Abstract base class
- Common API:
build(),search(),train() - Manages vector storage through
VectorStore
- Common API:
-
KDTreeIndex - Exact nearest neighbor search
- Tree-based partitioning for exact search
- Useful for small datasets or validation
-
IVFFlatIndex - Inverted file index
- K-means clustering for coarse quantization
- Optional OpenMP parallelization for centroid search and list probing
- Optional SIMD optimization for distance calculations
-
HNSWIndex - Hierarchical navigable small world graph
- Built on Faiss's HNSW implementation
- Graph-based approximate search
- Conditional compilation controls optimization features via
ENABLE_SIMDandENABLE_OPENMPflags - naive variant: Scalar operations, single-threaded
- openmp variant: Multi-threaded with OpenMP pragmas
- simd variant: AVX2 vectorized L2 distance calculations
- full variant: Combines OpenMP + SIMD for maximum performance
- All variants use standard C++ STL containers for data structures
- Initialize - Create an index instance (e.g.,
IVFFlatIndex,HNSWIndex) - Build - Add vector data using
build(data)- For IVF: automatically trains K-means clustering
- For HNSW: constructs the graph structure
- Search - Query for nearest neighbors using
search(query, k)- Returns
SearchResultwith indices and distances
- Returns
- Batch Search - Process multiple queries with
search_batch(queries, k) - Persistence - Save/load index with
write_index()/read_index()
import zenann
import numpy as np
# Create random data (1000 vectors, 128 dimensions)
data = np.random.rand(1000, 128).astype('float32')
query = np.random.rand(128).astype('float32')
# Option 1: IVF Index
ivf_index = zenann.IVFFlatIndex(dim=128, nlist=10, nprobe=3)
ivf_index.build(data)
results = ivf_index.search(query, k=5)
# Option 2: HNSW Index
hnsw_index = zenann.HNSWIndex(dim=128, M=16, efConstruction=200)
hnsw_index.build(data)
hnsw_index.set_ef_search(50)
results = hnsw_index.search(query, k=5)
# Option 3: KD-Tree (exact search)
kd_index = zenann.KDTreeIndex(dim=128)
kd_index.build(data)
results = kd_index.search(query, k=5)
# Access results
print("Indices:", results.indices)
print("Distances:", results.distances)# Search multiple queries at once
queries = np.random.rand(100, 128).astype('float32')
batch_results = ivf_index.search_batch(queries, k=5)
for i, result in enumerate(batch_results):
print(f"Query {i}: {result.indices[:3]}") # Top 3 results# Save index to disk
ivf_index.write_index("my_index.bin")
# Load index from disk
loaded_index = zenann.IVFFlatIndex.read_index("my_index.bin")The CUDA variant provides the same Python API with GPU acceleration:
import zenann
import numpy as np
# Build with CUDA variant first: make cuda
data = np.random.rand(1000000, 128).astype('float32')
queries = np.random.rand(10000, 128).astype('float32')
# Same API, GPU-accelerated backend
ivf_index = zenann.IVFFlatIndex(dim=128, nlist=1024, nprobe=16)
ivf_index.build(data)
# GPU-accelerated search
results = ivf_index.search_batch(queries, k=10)
# CUDA achieves 82K+ QPS on SIFT1M (k=10, nprobe=1)Note: The CUDA variant automatically uses GPU for IVF-Flat search operations. No API changes required.
ZenANN provides comprehensive benchmarking tools to evaluate performance across different optimization variants.
# Set library path
export LD_LIBRARY_PATH=extern/faiss/build/install/lib:$LD_LIBRARY_PATH
# Run comprehensive benchmark on SIFT1M
python3 benchmark/comprehensive_bench.py \
--base data/sift/sift_base.fvecs \
--query data/sift/sift_query.fvecs \
--groundtruth data/sift/sift_groundtruth.ivecs \
--nlist 1024 \
--nprobe-list "1,2,4,8,16,32,64" \
--k-list "1,10,100" \
--index-file sift_index.bin \
--output-dir benchmark_results
# Generate Recall-QPS trade-off plots
python3 benchmark/plot_tradeoff.py benchmark_results/*.jsonThe benchmark suite measures:
- QPS (Queries Per Second): Throughput for batch queries
- Latency: Mean, p50, p95, p99 response times
- Recall@k: Accuracy for k=1, 10, 100
- Index Build Time: Time to construct the index
- Memory Usage: Bytes per vector
# Test OpenMP variant
make openmp
python3 benchmark/comprehensive_bench.py ... --output-dir results_openmp
# Test CUDA variant
make cuda
python3 benchmark/comprehensive_bench.py ... --output-dir results_cuda
# Compare results
python3 benchmark/plot_tradeoff.py results_*/*.jsonSee benchmark/BENCHMARK_GUIDE.md for detailed instructions.
Base Requirements:
- C++17 compiler (g++, clang++)
- Python >= 3.10
- CMake >= 3.17 (for Faiss)
- Ninja build system
- OpenBLAS
Additional Requirements for CUDA variant:
- CUDA Toolkit >= 10.0
- NVIDIA GPU with compute capability >= 6.0 (Pascal or newer)
# 1. Clone repository with submodules
git clone --recursive https://github.com/5000user5000/ZenANN.git
cd ZenANN
# 2. Build Faiss dependency
cd extern/faiss
cmake -B build -G Ninja \
-DCMAKE_BUILD_TYPE=Release \
-DBUILD_SHARED_LIBS=ON \
-DCMAKE_INSTALL_PREFIX="${PWD}/build/install" \
-DFAISS_ENABLE_PYTHON=OFF \
-DFAISS_ENABLE_GPU=OFF
cmake --build build
cmake --install build
cd ../..
# 3. Build ZenANN (choose a variant)
make # Build full version (default, OpenMP + SIMD)
make full # Same as above
make naive # Build naive version (no optimizations)
make openmp # Build OpenMP-only version
make simd # Build SIMD-only version
make cuda # Build CUDA version (GPU acceleration)
make profiling # Build profiling version (Full + timing)
# 4. Run tests
LD_LIBRARY_PATH=extern/faiss/build/install/lib pytest tests/Choose the appropriate variant for your needs:
| Target | Optimizations | Use Case |
|---|---|---|
make naive |
None | Baseline reference, debugging |
make openmp |
Multi-threading only | Study OpenMP impact |
make simd |
SIMD (AVX2) only | Study vectorization impact |
make full |
OpenMP + SIMD | Production use (default) |
make cuda |
GPU kernels | GPU acceleration, highest QPS |
make profiling |
Full + timing | Performance analysis |
CUDA Build Notes:
- Ensure
nvccis in your PATH and CUDA Toolkit is properly installed - Adjust
CUDA_ARCHin Makefile to match your GPU (sm_60=Pascal, sm_75=Turing, sm_86=Ampere) - The CUDA variant uses pure GPU acceleration (no OpenMP/SIMD)
- Hybrid kernel strategy automatically handles k values up to 100
All unit tests validate functional correctness only (not performance):
pytest tests/ -v
# Test specific index
pytest tests/test_ivfflat.py -v
pytest tests/test_hnsw.py -v
pytest tests/test_kdtree.py -vAll variants provide correct results with different performance profiles:
| Variant | Performance | Key Features |
|---|---|---|
| naive | Baseline (1x) | Single-threaded, scalar operations |
| openmp | ~10x faster | Multi-threaded parallelization |
| simd | ~3x faster | AVX2 vectorized distance calculations |
| full | ~15-20x faster | Combined OpenMP + SIMD optimizations , highest QPS in k = 100 |
| cuda | ~20-25x faster | GPU parallelization, highest QPS in k=1,10 |
Performance factors:
- Actual speedup depends on dataset size, dimensionality, and hardware
- OpenMP scales with CPU core count (tested on 8-core systems)
- SIMD provides consistent 3x speedup for L2 distance calculations
- Combining optimizations often yields multiplicative benefits
- CUDA achieves 82K+ QPS on SIFT1M (k=10, nprobe=1) with NVIDIA GPUs
Optimization breakdown:
- Distance calculations: SIMD provides ~3x speedup (processes 8 floats per instruction with AVX2)
- Centroid search: OpenMP parallelizes across centroids; CUDA uses GPU threads
- List probing: OpenMP parallelizes across probe lists; CUDA uses 2D grid mapping
- Batch queries: OpenMP parallelizes across multiple queries; CUDA processes batch on GPU
- Top-K selection: CUDA uses hybrid strategy (shared memory vs heap-based) for optimal performance
ZenANN/
├── include/zenann/ # C++ headers
│ ├── IndexBase.h
│ ├── IVFFlatIndex.h
│ ├── HNSWIndex.h
│ ├── KDTreeIndex.h
│ ├── VectorStore.h
│ ├── SimdUtils.h # L2 distance with optional SIMD (conditional compilation)
│ └── CudaUtils.h # CUDA kernel declarations
├── src/ # C++ implementation (with conditional OpenMP pragmas)
│ ├── IndexBase.cpp
│ ├── IVFFlatIndex.cpp
│ ├── KDTreeIndex.cpp
│ ├── HNSWIndex.cpp
│ ├── CudaUtils_2.cu # CUDA kernel optimization testing for k=1,10
│ └── CudaUtils.cu # CUDA kernel implementations
├── python/ # Python bindings (pybind11)
├── tests/ # Unit tests (pytest)
├── benchmark/ # Performance benchmarks
│ ├── comprehensive_bench.py # Complete benchmark suite
│ ├── ivf-bench.py # IVF-specific benchmarks
│ ├── hnsw-bench.py # HNSW-specific benchmarks
│ ├── plot_tradeoff.py # Visualization tools
│ └── BENCHMARK_GUIDE.md # Benchmarking documentation
├── doc/ # Technical documentation
│ ├── cuda.md # CUDA implementation guide
│ └── cuda-fix.md # CUDA k=100 fix documentation
├── extern/faiss/ # Faiss submodule
└── Makefile # Build configuration with multiple targets
- tests/ - Usage examples in test files
- Makefile - Run
make helpfor build variant information
- Build: GNU Make, CMake
- Testing: pytest
- CI/CD: GitHub Actions (tests full variant)
- Version Control: Git
This project is licensed under the MIT License - see the LICENSE file for details.