This repository contains the artifact materials for our OSDI '26 paper: Disentangling Graph Dependencies for Efficient Billion-Scale GPU Vector Search (FlowANN). The artifact is built on top of cuVS/CAGRA, including the source code and benchmark.
- Supported Platform
- File Layout
- Datasets
- Environment Setup
- Build
- Graph Preparation
- Benchmark Quick Start
- Acknowledgments
- Contact
The artifact is intended to run in a Linux container with NVIDIA GPU support.
- Provided Docker image:
haorupomelo/cuvs:latest - OS in container: Ubuntu 22.04
- CUDA: 12.4
- GCC: 11.4
- CMake: 3.30.5
- GDRCopy on host: 2.5.2
We provide the reference Docker image for artifact evaluation, but the image itself does not bundle GDRCopy. To use the low-latency GPU-to-CPU copy path in FlowANN, please install GDRCopy on the host machine first, make sure the gdrdrv kernel module is loaded, and pass the device into the container.
On the host side, you need:
- a Linux machine with Docker and the NVIDIA Container Toolkit
- at least one visible CUDA GPU
- a working host-side GDRCopy installation, with
/dev/gdrdrvavailable
This repository contains the complete source code and benchmark needed for evaluation. The main benchmark entry point is examples/cpp/src/cagra_benchmark.cu, and the graph grouping construction code is in grouping/.
The most relevant files for artifact evaluation are:
.
├── build.sh # Top-level build entry point
├── cpp/ # FlowANN/cuVS core implementation
├── examples/ # Main benchmark and shared CLI
├── grouping/ # Graph grouping components
└── README.md
The dataset presets are hard-coded in examples/cpp/src/cagra_benchmark.cu. FlowANN primarily targets three billion-scale datasets:
sift_1bdeep_1bspacev_1b
The benchmark also supports smaller presets for debugging, parameter tuning, and functional verification, including:
sift_1M,sift_10M,sift_50M,sift_100M,sift_200M,sift_500Mdeep_1M,deep_10M,deep_100Mspacev_10M,spacev_100M
For artifact evaluation, we recommend validating the setup on a smaller preset first (for example sift_1M) before launching a full billion-scale run.
- SIFT1B / BIGANN: TEXMEX corpus
- Deep1B and Deep subset ground truth: matsui528/deep1b_gt
- SpaceV1B / SpaceV subsets: microsoft/SPTAG and Big ANN Benchmarks
For all three primary datasets, FlowANN expects the same three logical inputs:
- raw database vectors: the vectors to be indexed and searched
- query vectors: the query set used for benchmark search
- ground-truth neighbors: exact nearest-neighbor IDs used to compute recall/accuracy
- SIFT family:
sift_1band its subset presets (sift_1M...sift_500M) use raw vectors in.bvecs, queries in.bvecs, and ground truth in.ivecs.- The
.bvecsand.ivecsformats follow the standard TEXMEX layout: each vector starts with a 4-byte dimension header, followed byuint8,float, orint32payload respectively.
- Deep family:
- Database and query files are stored in
.fvecs. - Smaller presets (
deep_1M,deep_10M,deep_100M) use.ivecsground truth. - The full
deep_1bpreset currently expects an.ibinground-truth file. In this code path,.ibinmeans a file with header[int32 nvecs][int32 dim], followed by a flatint32payload.
- Database and query files are stored in
- SpaceV family:
- The database is stored as a directory of split binary shards rather than a single monolithic file.
- The query file is a custom
.binfile with header[uint32 query_count][uint32 dimension], followed by a contiguousint8payload. - The ground-truth file is a custom binary file with header
[int32 query_count][int32 topk], followed by contiguousint32neighbor IDs. - The database shard directory must contain files named
vectors_1.bin,vectors_2.bin, ... in numeric order. The loader scans the directory, sorts the shard names numerically, and concatenates the vector payload.
Below are the directory and naming conventions expected by the current code.
SIFT1B / SIFT subsets
/dataset/sift_1b/
├── bigann_base.bvecs
├── bigann_query.bvecs
├── bigann_learn.bvecs # used by subset presets such as sift_1M / sift_10M
├── sift_learn.fvecs # current train_db path for the sift_1b preset
└── gnd/
├── idx_1M.ivecs
├── idx_10M.ivecs
├── idx_50M.ivecs
├── idx_100M.ivecs
├── idx_200M.ivecs
├── idx_500M.ivecs
└── idx_1000M.ivecs
Deep1B / Deep subsets
/dataset/deep1b/
├── deep1B_queries.fvecs
├── deep1M_groundtruth.ivecs
├── deep10M_groundtruth.ivecs
├── deep100M_groundtruth.ivecs
└── deep1B_groundtruth.ibin
For the smaller deep_1M, deep_10M, and deep_100M presets, the benchmark reuses the same base.fvecs raw vector file together with subset-specific ground-truth files.
SpaceV1B / SpaceV subsets
/dataset/spacev1b/
├── query.bin
├── truth.bin
├── msspacev-10M.ibin
├── msspacev-100M.ibin
└── vectors.bin/
├── vectors_1.bin
├── vectors_2.bin
├── vectors_3.bin
└── ...
Start an interactive container and mount the repository, the dataset directory, and the host-side GDRCopy device/path:
docker run --rm -it \
--gpus all \
--network=host \
--ipc=host \
--device /dev/gdrdrv:/dev/gdrdrv \
-v <path-to-this-repo>:/workspace \
-v <path-to-datasets>:/dataset \
-v /usr/local/gdrcopy:/usr/local/gdrcopy:ro \
haorupomelo/cuvs:latest bashThe extra GDRCopy-related flags above assume that the host installation is available at /usr/local/gdrcopy. If your host uses a different installation prefix, replace that path accordingly. The most important passthrough is --device /dev/gdrdrv:/dev/gdrdrv.
If your environment requires explicit NVIDIA device mapping beyond --gpus all, add the corresponding --device /dev/nvidia* flags to the same command and keep the GDRCopy passthrough flags above.
After entering the container:
cd /workspaceRun the following build commands inside the container:
cd /workspace
./build.sh libcuvs --cache-tool=ccache -n
./build.sh examples --cache-tool=ccacheAfter a successful build, the benchmark binary is:
./examples/cpp/build/cagra_benchmarkBefore running the final FlowANN benchmark on this branch, switch to the **graph-building** branch first and prepare three artifacts:
- a raw CAGRA graph produced by the graph-building branch and stored under
/dataset/cagra/ - a tiered graph produced by the graph tiering pipeline and stored under
/dataset/cagra_subgraph/ - an entry-point candidate file produced by the graph-building branch
kmeans.cuprogram and stored under/dataset/init_kmeans/
The recommended workflow is:
Run the following operations on the graph-building branch:
cd /workspace
git checkout graph-building
git submodule update --init --recursive
./build.sh libcuvs --cache-tool=ccache -n
./build.sh examples --cache-tool=ccacheOn that branch, cagra_benchmark is used to build the original CAGRA graph, and kmeans.cu is used to generate the entry-point candidate file.
On the graph-building branch, run cagra_benchmark to serialize the original CAGRA graph to /dataset/cagra/.
For example, a sift_1b raw-graph build can be launched as:
./examples/cpp/build/cagra_benchmark -d sift_1b -k 10 -s 512 -a single_cta -t 8 -w 1 -g 32 -i 512 -q -c 15000 -P 16 With this parameter set, the serialized raw graph is expected at:
/dataset/cagra/sift_1b_pq_bits_8_pq_dim_16_graph_degree_32_n_center_15000.bin
Use the same dataset name, graph degree, centroid count, and PQ dimension that you plan to use later in the final benchmark.
After the raw graph is ready, run the graph tiering pipeline in grouping/. This stage takes the raw CAGRA graph and converts it into the tiered format expected by FlowANN. The easiest entry point is grouping/run_pipeline.sh.
For example, a sift_1b graph tiering / split-graph generation can be launched as:
cd grouping
./run_pipeline.sh --help # To see the meaning of each parameter
./run_pipeline.sh --dataset sift_1b \
--index-file /dataset/cagra/sift_1b_pq_bits_8_pq_dim_16_graph_degree_32_n_center_15000.binBy default, the final tiered graph is written under:
/dataset/cagra_subgraph/
Still on the graph-building branch, run examples/cpp/src/kmeans.cu to produce the entry-point candidate file consumed by the final FlowANN benchmark.
For example, a sift_1b entry-point candidate file generation can be launched as:
./examples/cpp/build/kmeans sift_1b 500 0This command uses the sift_1b training data, selects 500 medoid-based candidates, runs on GPU 0, and writes:
/dataset/init_kmeans/sift_1b_ncentroids_500.bin
The same workflow applies to deep_1b and spacev_1b. Please switch to the graph-building branch and refer to the scripts in scripts/ for more details.
After these three artifacts are ready, switch back to the main branch and run the final FlowANN benchmark.
- Launch the Docker container and enter
/workspace. - Build the project with the commands in the previous section.
- If the required raw graph, split graph, and entry-point candidate files are not prepared yet, follow
Graph Preparation. - Run the following example
sift_1bbenchmark command:
./examples/cpp/build/cagra_benchmark \
-d sift_1b \
-b 2048 \
-k 10 \
-c 15000 \
-s 150 \
-a single_cta \
-t 8 \
-w 1 \
-i 270 \
-r \
-g 32 \
-n 24 \
-q \
-f 500 \
-M /dataset/init_kmeans/sift_1b_ncentroids_500.bin \
-P 16 \
-Q 78 \
-G 0In words, this command runs the `sift_1b` preset with a batch size of 2048 queries, final `topk=10`, VPQ enabled, `single_cta` search, graph degree 32, 24-bit split/subgraph encoding, 500 seed nodes loaded from a precomputed k-means file, PQ dimension 16, 78 copy queues, and GPU device 0.
- You can also run the script to run the benchmark on all three billion-scale datasets:
./scripts/overall.shThe benchmark prints per-batch timing and quality statistics such as search time, rerank time, recall/accuracy, throughput, and the number of completed batches.
The CLI is defined in examples/cpp/src/common.cuh. For reproducibility, we recommend setting the important performance knobs explicitly instead of relying on implicit defaults.
-d,--dataset <name>: Selects the dataset preset and the corresponding hard-coded file paths. Example:-d sift_1b.-b,--batch-size <int>: Number of queries processed in one batch. Larger values usually improve throughput if memory allows. Example:-b 2048.-k,--topk <int>: Final number of nearest neighbors returned per query. Example:-k 10.-a,--algorithm <name>: Search kernel selection. Supported values areauto,single_cta, andmulti_cta. Example:-a single_cta.-G,--gpu-id <int>: CUDA device ID used by the benchmark. Example:-G 0.-h,--help: Prints the CLI help message and exits. Example:./examples/cpp/build/cagra_benchmark -h.
-s,--search-k <int>: First-stage candidate count before reranking. In this benchmark it is most relevant when-qis enabled. Example:-s 150.-t,--team-size <int>: Number of cooperative threads used for one distance computation. Must be 0 or a power of two up to 32. Example:-t 8.-w,--search-width <int>: Search breadth, i.e. how many entry points are expanded per iteration. Higher values may improve recall at higher cost. Example:-w 1.-i,--itopk-size <int>: Internal candidate buffer size retained during graph search. Larger values usually trade more work for better recall. Example:-i 270.-r,--is-cpu-rerank: Enables CPU reranking for VPQ candidates against the original vectors. Example:-r.-f,--num-random-sample <int>: Number of candidate entry points used to initialize search. When combined with-M, this many candidate entry points are loaded from the precomputed file. Example:-f 500.-M,--kmeans-centroids-path <path>: Path to the precomputed centroid/seed file used to initialize search. Example:-M /dataset/init_kmeans/sift_1b_ncentroids_500.bin.-Q,--copier-queue-num <int>: Number of device-side copy queues used by the search kernels. Example:-Q 78.-T,--rerank-thread-num <int>: OpenMP thread count for CPU reranking. The canonical command omits it and therefore uses the built-in default of128. Example variant: add-T 159.
-c,--ncentroids <int>: Number of coarse centroids used for VPQ/VQ training and graph-build preparation. Example:-c 15000.-q,--quantize: Enables VPQ compression for the CAGRA index. Example:-q.-g,--graph-degree <int>: Out-degree of the built FlowANN graph. Example:-g 32.-n,--n-bits <int>: Bit-width used to encode neighbor IDs in the GPU-cached graph representation. Example:-n 24.-P,--pq-dim <int>: PQ subspace dimension used by VPQ compression. Example:-P 16.-L,--limited-memory-gb <int>: Selects the serialized tiered graph variant tagged with a memory limit in GB. In the current benchmark code, this mainly changes the saved/loaded index filename. Example variant: add-L 64.
If you want to use FlowANN in your research, please cite our paper:
@inproceedings {flowann,
author={Haoru Zhao and Jingkai He and Jingyao Zeng and Mingkai Dong and Dong Du},
title={Disentangling Graph Dependencies for Efficient Billion-Scale {GPU} Vector Search},
booktitle = {20th USENIX Symposium on Operating Systems Design and Implementation (OSDI 26)},
year = {2026},
address = {Seattle, WA},
url = {https://www.usenix.org/conference/osdi26/presentation/zhao},
publisher = {USENIX Association},
month = jul
}We thank the cuVS team for their contributions to the cuVS codebase that this artifact builds upon. Their engineering efforts and open-source infrastructure provided an important foundation for this work.
If you have any questions about this code, please contact Haoru Zhao (zhaohaoru@sjtu.edu.cn) or Jingkai He (hjk020101@sjtu.edu.cn).