ncd-tree

Overview

ncd-tree is a Haskell library for similarity-based nearest neighbor search using the Normalized Compression Distance (NCD) metric combined with Vantage-Point (VP) trees.

What it does

Define a notion of similarity between documents based on their compressibility with gzip
Build an index structure (VP-tree) that accelerates k-nearest neighbor searches
Query the index to find documents most similar to a given query document

Why NCD?

The Normalized Compression Distance is a universal distance metric that works for any type of data (text, sequences, etc.) without requiring domain-specific features or models. It's based on the intuition that the most efficient way to describe the similarity between two objects is the length of the shortest program that computes one object given the other.

For two documents x and y:

NCD(x, y) = (C(xy) - min(C(x), C(y))) / max(C(x), C(y))

where C(s) is the compressed size of s using gzip.

This metric has been rediscovered multiple times in the research literature, first for clustering (Cilibrasi and Vitanyi 2005) and more recently in the text classification setting (Jiang et al 2023).

Why VP-trees?

VP-trees are a space-partitioning data structure that significantly accelerates nearest neighbor searches by using the triangle inequality to prune unpromising branches during traversal.

Usage

{-# LANGUAGE OverloadedStrings #-}

from Data.NCDTree import Document, mkVPTree, knnSearch

-- Create documents using string literals (thanks to IsString instance)
docs :: [Document]
docs = [ "hello world"
        , "hello universe"
        , "goodbye world"
        , "hello there"
        ]

-- Build the VP-tree index with a leaf threshold of 4
tree = mkVPTree 4 docs

-- Search for the 2 nearest neighbors of "hello"
query :: Document
query = "hello"

results = knnSearch 2 query tree

-- results is a max-heap with the closest documents

Key Functions

mkVPTree :: Int -> [Document] -> VPTree - Build a VP-tree index from a list of documents. The first argument is the leaf threshold (when to stop splitting nodes).
knnSearch :: Int -> Document -> VPTree -> ResultHeap - Find the k nearest neighbors of a query document. Returns a heap of results ordered by distance (smallest first when extracted).
ncd :: BL.ByteString -> BL.ByteString -> Double - Compute the normalized compression distance between two bytestrings directly.

Characteristics

Universal metric: Works with any data that can be compressed, no feature engineering or model training needed
Approximate search: VP-tree pruning makes it an approximate (but highly accurate in practice) nearest neighbor search

Testing

The library includes a comprehensive property-based test suite covering:

Core distance and tree construction properties
Similarity search correctness
Edge cases and tree structure invariants

Run tests with:

stack test

References

Jiang, Z. et al., “Low-Resource” Text Classification: A Parameter-Free Classification Method with Compressors, https://aclanthology.org/2023.findings-acl.426/
Cleary, J. G., & Trigg, R. H. (1979). "K-W-nearest neighbor classification for statistically dependent data"
Yianilos, P. N. (1993). "Data structures and algorithms for nearest neighbor search in general metric spaces"
Li, M., Chen, X., Li, X., Ma, B., & Vitányi, P. M. (2004). "The similarity metric"
Cilibrasi, R., & Vitányi, P. M. (2005). "Clustering by compression". https://arxiv.org/abs/cs/0312044 IEEE Transactions on Information Theory, 51(4), 1440-1454.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
.github/workflows		.github/workflows
app		app
bench		bench
src/Data		src/Data
test		test
.gitignore		.gitignore
.travis.yml		.travis.yml
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
Setup.hs		Setup.hs
ncd-tree.cabal		ncd-tree.cabal
stack.yaml		stack.yaml
stack.yaml.lock		stack.yaml.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

ncd-tree

Overview

What it does

Why NCD?

Why VP-trees?

Usage

Key Functions

Characteristics

Testing

References

About

Uh oh!

Releases

Languages

License

ocramz/ncd-tree

Folders and files

Latest commit

History

Repository files navigation

ncd-tree

Overview

What it does

Why NCD?

Why VP-trees?

Usage

Key Functions

Characteristics

Testing

References

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Languages