Disentangling Graph Dependencies for Efficient Billion-Scale GPU Vector Search OSDI 26 Artifact

This repository contains the artifact materials for our OSDI '26 paper: Disentangling Graph Dependencies for Efficient Billion-Scale GPU Vector Search (FlowANN). The artifact is built on top of cuVS/CAGRA, including the source code and benchmark.

Supported Platform
File Layout
Datasets
Environment Setup
Build
Graph Preparation
Benchmark Quick Start
Acknowledgments
Contact

Supported Platform

The artifact is intended to run in a Linux container with NVIDIA GPU support.

Provided Docker image: haorupomelo/cuvs:latest
OS in container: Ubuntu 22.04
CUDA: 12.4
GCC: 11.4
CMake: 3.30.5
GDRCopy on host: 2.5.2

We provide the reference Docker image for artifact evaluation, but the image itself does not bundle GDRCopy. To use the low-latency GPU-to-CPU copy path in FlowANN, please install GDRCopy on the host machine first, make sure the gdrdrv kernel module is loaded, and pass the device into the container.

On the host side, you need:

a Linux machine with Docker and the NVIDIA Container Toolkit
at least one visible CUDA GPU
a working host-side GDRCopy installation, with /dev/gdrdrv available

File Layout

This repository contains the complete source code and benchmark needed for evaluation. The main benchmark entry point is examples/cpp/src/cagra_benchmark.cu, and the graph grouping construction code is in grouping/.

The most relevant files for artifact evaluation are:

.
├── build.sh                           # Top-level build entry point
├── cpp/                               # FlowANN/cuVS core implementation
├── examples/                          # Main benchmark and shared CLI
├── grouping/                          # Graph grouping components
└── README.md

Datasets

The dataset presets are hard-coded in examples/cpp/src/cagra_benchmark.cu. FlowANN primarily targets three billion-scale datasets:

sift_1b
deep_1b
spacev_1b

The benchmark also supports smaller presets for debugging, parameter tuning, and functional verification, including:

sift_1M, sift_10M, sift_50M, sift_100M, sift_200M, sift_500M
deep_1M, deep_10M, deep_100M
spacev_10M, spacev_100M

For artifact evaluation, we recommend validating the setup on a smaller preset first (for example sift_1M) before launching a full billion-scale run.

Dataset Sources

SIFT1B / BIGANN: TEXMEX corpus
Deep1B and Deep subset ground truth: matsui528/deep1b_gt
SpaceV1B / SpaceV subsets: microsoft/SPTAG and Big ANN Benchmarks

Required Files

For all three primary datasets, FlowANN expects the same three logical inputs:

raw database vectors: the vectors to be indexed and searched
query vectors: the query set used for benchmark search
ground-truth neighbors: exact nearest-neighbor IDs used to compute recall/accuracy

File Formats

SIFT family:
- sift_1b and its subset presets (sift_1M ... sift_500M) use raw vectors in .bvecs, queries in .bvecs, and ground truth in .ivecs.
- The .bvecs and .ivecs formats follow the standard TEXMEX layout: each vector starts with a 4-byte dimension header, followed by uint8, float, or int32 payload respectively.
Deep family:
- Database and query files are stored in .fvecs.
- Smaller presets (deep_1M, deep_10M, deep_100M) use .ivecs ground truth.
- The full deep_1b preset currently expects an .ibin ground-truth file. In this code path, .ibin means a file with header [int32 nvecs][int32 dim], followed by a flat int32 payload.
SpaceV family:
- The database is stored as a directory of split binary shards rather than a single monolithic file.
- The query file is a custom .bin file with header [uint32 query_count][uint32 dimension], followed by a contiguous int8 payload.
- The ground-truth file is a custom binary file with header [int32 query_count][int32 topk], followed by contiguous int32 neighbor IDs.
- The database shard directory must contain files named vectors_1.bin, vectors_2.bin, ... in numeric order. The loader scans the directory, sorts the shard names numerically, and concatenates the vector payload.

Expected Layouts

Below are the directory and naming conventions expected by the current code.

SIFT1B / SIFT subsets

/dataset/sift_1b/
├── bigann_base.bvecs
├── bigann_query.bvecs
├── bigann_learn.bvecs                  # used by subset presets such as sift_1M / sift_10M
├── sift_learn.fvecs                    # current train_db path for the sift_1b preset
└── gnd/
    ├── idx_1M.ivecs
    ├── idx_10M.ivecs
    ├── idx_50M.ivecs
    ├── idx_100M.ivecs
    ├── idx_200M.ivecs
    ├── idx_500M.ivecs
    └── idx_1000M.ivecs

Deep1B / Deep subsets

/dataset/deep1b/
├── deep1B_queries.fvecs
├── deep1M_groundtruth.ivecs
├── deep10M_groundtruth.ivecs
├── deep100M_groundtruth.ivecs
└── deep1B_groundtruth.ibin

For the smaller deep_1M, deep_10M, and deep_100M presets, the benchmark reuses the same base.fvecs raw vector file together with subset-specific ground-truth files.

SpaceV1B / SpaceV subsets

/dataset/spacev1b/
├── query.bin
├── truth.bin
├── msspacev-10M.ibin
├── msspacev-100M.ibin
└── vectors.bin/
    ├── vectors_1.bin
    ├── vectors_2.bin
    ├── vectors_3.bin
    └── ...

Environment Setup

Start an interactive container and mount the repository, the dataset directory, and the host-side GDRCopy device/path:

docker run --rm -it \
  --gpus all \
  --network=host \
  --ipc=host \
  --device /dev/gdrdrv:/dev/gdrdrv \
  -v <path-to-this-repo>:/workspace \
  -v <path-to-datasets>:/dataset \
  -v /usr/local/gdrcopy:/usr/local/gdrcopy:ro \
  haorupomelo/cuvs:latest bash

The extra GDRCopy-related flags above assume that the host installation is available at /usr/local/gdrcopy. If your host uses a different installation prefix, replace that path accordingly. The most important passthrough is --device /dev/gdrdrv:/dev/gdrdrv.

If your environment requires explicit NVIDIA device mapping beyond --gpus all, add the corresponding --device /dev/nvidia* flags to the same command and keep the GDRCopy passthrough flags above.

After entering the container:

cd /workspace

Build

Run the following build commands inside the container:

cd /workspace
./build.sh libcuvs --cache-tool=ccache -n
./build.sh examples --cache-tool=ccache

After a successful build, the benchmark binary is:

./examples/cpp/build/cagra_benchmark

Graph Preparation

Before running the final FlowANN benchmark on this branch, switch to the **graph-building** branch first and prepare three artifacts:

a raw CAGRA graph produced by the graph-building branch and stored under /dataset/cagra/
a tiered graph produced by the graph tiering pipeline and stored under /dataset/cagra_subgraph/
an entry-point candidate file produced by the graph-building branch kmeans.cu program and stored under /dataset/init_kmeans/

The recommended workflow is:

Step 0: Switch to `graph-building`

Run the following operations on the graph-building branch:

cd /workspace
git checkout graph-building
git submodule update --init --recursive
./build.sh libcuvs --cache-tool=ccache -n
./build.sh examples --cache-tool=ccache

On that branch, cagra_benchmark is used to build the original CAGRA graph, and kmeans.cu is used to generate the entry-point candidate file.

Step 1: Build the Raw CAGRA Graph

On the graph-building branch, run cagra_benchmark to serialize the original CAGRA graph to /dataset/cagra/.

For example, a sift_1b raw-graph build can be launched as:

./examples/cpp/build/cagra_benchmark -d sift_1b -k 10 -s 512 -a single_cta -t 8 -w 1 -g 32 -i 512 -q -c 15000 -P 16

With this parameter set, the serialized raw graph is expected at:

/dataset/cagra/sift_1b_pq_bits_8_pq_dim_16_graph_degree_32_n_center_15000.bin

Use the same dataset name, graph degree, centroid count, and PQ dimension that you plan to use later in the final benchmark.

Step 2: Run Graph Tiering Generation

After the raw graph is ready, run the graph tiering pipeline in grouping/. This stage takes the raw CAGRA graph and converts it into the tiered format expected by FlowANN. The easiest entry point is grouping/run_pipeline.sh.

For example, a sift_1b graph tiering / split-graph generation can be launched as:

cd grouping
./run_pipeline.sh --help # To see the meaning of each parameter
./run_pipeline.sh --dataset sift_1b \
  --index-file /dataset/cagra/sift_1b_pq_bits_8_pq_dim_16_graph_degree_32_n_center_15000.bin

By default, the final tiered graph is written under:

/dataset/cagra_subgraph/

Step 3: Generate Entry-Point Candidates with `kmeans.cu`

Still on the graph-building branch, run examples/cpp/src/kmeans.cu to produce the entry-point candidate file consumed by the final FlowANN benchmark.

For example, a sift_1b entry-point candidate file generation can be launched as:

./examples/cpp/build/kmeans sift_1b 500 0

This command uses the sift_1b training data, selects 500 medoid-based candidates, runs on GPU 0, and writes:

/dataset/init_kmeans/sift_1b_ncentroids_500.bin

The same workflow applies to deep_1b and spacev_1b. Please switch to the graph-building branch and refer to the scripts in scripts/ for more details.

After these three artifacts are ready, switch back to the main branch and run the final FlowANN benchmark.

Benchmark Quick Start

Launch the Docker container and enter /workspace.
Build the project with the commands in the previous section.
If the required raw graph, split graph, and entry-point candidate files are not prepared yet, follow Graph Preparation.
Run the following example sift_1b benchmark command:

  ./examples/cpp/build/cagra_benchmark \
      -d sift_1b \
      -b 2048 \
      -k 10 \
      -c 15000 \
      -s 150 \
      -a single_cta \
      -t 8 \
      -w 1 \
      -i 270 \
      -r \
      -g 32 \
      -n 24 \
      -q \
      -f 500 \
      -M /dataset/init_kmeans/sift_1b_ncentroids_500.bin \
      -P 16 \
      -Q 78 \
      -G 0

In words, this command runs the `sift_1b` preset with a batch size of 2048 queries, final `topk=10`, VPQ enabled, `single_cta` search, graph degree 32, 24-bit split/subgraph encoding, 500 seed nodes loaded from a precomputed k-means file, PQ dimension 16, 78 copy queues, and GPU device 0.

You can also run the script to run the benchmark on all three billion-scale datasets:

  ./scripts/overall.sh

The benchmark prints per-batch timing and quality statistics such as search time, rerank time, recall/accuracy, throughput, and the number of completed batches.

Benchmark Command Reference

The CLI is defined in examples/cpp/src/common.cuh. For reproducibility, we recommend setting the important performance knobs explicitly instead of relying on implicit defaults.

Core options

-d, --dataset <name>: Selects the dataset preset and the corresponding hard-coded file paths. Example: -d sift_1b.
-b, --batch-size <int>: Number of queries processed in one batch. Larger values usually improve throughput if memory allows. Example: -b 2048.
-k, --topk <int>: Final number of nearest neighbors returned per query. Example: -k 10.
-a, --algorithm <name>: Search kernel selection. Supported values are auto, single_cta, and multi_cta. Example: -a single_cta.
-G, --gpu-id <int>: CUDA device ID used by the benchmark. Example: -G 0.
-h, --help: Prints the CLI help message and exits. Example: ./examples/cpp/build/cagra_benchmark -h.

Search, traversal, and rerank options

-s, --search-k <int>: First-stage candidate count before reranking. In this benchmark it is most relevant when -q is enabled. Example: -s 150.
-t, --team-size <int>: Number of cooperative threads used for one distance computation. Must be 0 or a power of two up to 32. Example: -t 8.
-w, --search-width <int>: Search breadth, i.e. how many entry points are expanded per iteration. Higher values may improve recall at higher cost. Example: -w 1.
-i, --itopk-size <int>: Internal candidate buffer size retained during graph search. Larger values usually trade more work for better recall. Example: -i 270.
-r, --is-cpu-rerank: Enables CPU reranking for VPQ candidates against the original vectors. Example: -r.
-f, --num-random-sample <int>: Number of candidate entry points used to initialize search. When combined with -M, this many candidate entry points are loaded from the precomputed file. Example: -f 500.
-M, --kmeans-centroids-path <path>: Path to the precomputed centroid/seed file used to initialize search. Example: -M /dataset/init_kmeans/sift_1b_ncentroids_500.bin.
-Q, --copier-queue-num <int>: Number of device-side copy queues used by the search kernels. Example: -Q 78.
-T, --rerank-thread-num <int>: OpenMP thread count for CPU reranking. The canonical command omits it and therefore uses the built-in default of 128. Example variant: add -T 159.

Index construction and compression options

-c, --ncentroids <int>: Number of coarse centroids used for VPQ/VQ training and graph-build preparation. Example: -c 15000.
-q, --quantize: Enables VPQ compression for the CAGRA index. Example: -q.
-g, --graph-degree <int>: Out-degree of the built FlowANN graph. Example: -g 32.
-n, --n-bits <int>: Bit-width used to encode neighbor IDs in the GPU-cached graph representation. Example: -n 24.
-P, --pq-dim <int>: PQ subspace dimension used by VPQ compression. Example: -P 16.
-L, --limited-memory-gb <int>: Selects the serialized tiered graph variant tagged with a memory limit in GB. In the current benchmark code, this mainly changes the saved/loaded index filename. Example variant: add -L 64.

Paper

If you want to use FlowANN in your research, please cite our paper:

@inproceedings {flowann,
   author={Haoru Zhao and Jingkai He and Jingyao Zeng and Mingkai Dong and Dong Du},
   title={Disentangling Graph Dependencies for Efficient Billion-Scale {GPU} Vector Search}, 
  booktitle = {20th USENIX Symposium on Operating Systems Design and Implementation (OSDI 26)},
  year = {2026},
  address = {Seattle, WA},
  url = {https://www.usenix.org/conference/osdi26/presentation/zhao},
  publisher = {USENIX Association},
  month = jul
}

Acknowledgments

We thank the cuVS team for their contributions to the cuVS codebase that this artifact builds upon. Their engineering efforts and open-source infrastructure provided an important foundation for this work.

Contact

If you have any questions about this code, please contact Haoru Zhao (zhaohaoru@sjtu.edu.cn) or Jingkai He (hjk020101@sjtu.edu.cn).

Name		Name	Last commit message	Last commit date
Latest commit History 574 Commits
.devcontainer		.devcontainer
.github		.github
ci		ci
cmake		cmake
conda		conda
cpp		cpp
docs		docs
examples		examples
go		go
img		img
java		java
notebooks		notebooks
output/um		output/um
python		python
rust		rust
scripts		scripts
thirdparty/LICENSES		thirdparty/LICENSES
.flake8		.flake8
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.md		README.md
VERSION		VERSION
build.md		build.md
build.sh		build.sh
cuvs.md		cuvs.md
dependencies.yaml		dependencies.yaml
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Disentangling Graph Dependencies for Efficient Billion-Scale GPU Vector Search OSDI 26 Artifact

Supported Platform

File Layout

Datasets

Dataset Sources

Required Files

File Formats

Expected Layouts

Environment Setup

Build

Graph Preparation

Step 0: Switch to `graph-building`

Step 1: Build the Raw CAGRA Graph

Step 2: Run Graph Tiering Generation

Step 3: Generate Entry-Point Candidates with `kmeans.cu`

Benchmark Quick Start

Benchmark Command Reference

Core options

Search, traversal, and rerank options

Index construction and compression options

Paper

Acknowledgments

Contact

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Disentangling Graph Dependencies for Efficient Billion-Scale GPU Vector Search OSDI 26 Artifact

Supported Platform

File Layout

Datasets

Dataset Sources

Required Files

File Formats

Expected Layouts

Environment Setup

Build

Graph Preparation

Step 0: Switch to graph-building

Step 1: Build the Raw CAGRA Graph

Step 2: Run Graph Tiering Generation

Step 3: Generate Entry-Point Candidates with kmeans.cu

Benchmark Quick Start

Benchmark Command Reference

Core options

Search, traversal, and rerank options

Index construction and compression options

Paper

Acknowledgments

Contact

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Step 0: Switch to `graph-building`

Step 3: Generate Entry-Point Candidates with `kmeans.cu`

Packages