ncd-tree is a Haskell library for similarity-based nearest neighbor search using the Normalized Compression Distance (NCD) metric combined with Vantage-Point (VP) trees.
- Define a notion of similarity between documents based on their compressibility with gzip
- Build an index structure (VP-tree) that accelerates k-nearest neighbor searches
- Query the index to find documents most similar to a given query document
The Normalized Compression Distance is a universal distance metric that works for any type of data (text, sequences, etc.) without requiring domain-specific features or models. It's based on the intuition that the most efficient way to describe the similarity between two objects is the length of the shortest program that computes one object given the other.
For two documents x and y:
NCD(x, y) = (C(xy) - min(C(x), C(y))) / max(C(x), C(y))
where C(s) is the compressed size of s using gzip.
This metric has been rediscovered multiple times in the research literature, first for clustering (Cilibrasi and Vitanyi 2005) and more recently in the text classification setting (Jiang et al 2023).
VP-trees are a space-partitioning data structure that significantly accelerates nearest neighbor searches by using the triangle inequality to prune unpromising branches during traversal.
{-# LANGUAGE OverloadedStrings #-}
from Data.NCDTree import Document, mkVPTree, knnSearch
-- Create documents using string literals (thanks to IsString instance)
docs :: [Document]
docs = [ "hello world"
, "hello universe"
, "goodbye world"
, "hello there"
]
-- Build the VP-tree index with a leaf threshold of 4
tree = mkVPTree 4 docs
-- Search for the 2 nearest neighbors of "hello"
query :: Document
query = "hello"
results = knnSearch 2 query tree
-- results is a max-heap with the closest documents-
mkVPTree :: Int -> [Document] -> VPTree- Build a VP-tree index from a list of documents. The first argument is the leaf threshold (when to stop splitting nodes). -
knnSearch :: Int -> Document -> VPTree -> ResultHeap- Find the k nearest neighbors of a query document. Returns a heap of results ordered by distance (smallest first when extracted). -
ncd :: BL.ByteString -> BL.ByteString -> Double- Compute the normalized compression distance between two bytestrings directly.
- Universal metric: Works with any data that can be compressed, no feature engineering or model training needed
- Approximate search: VP-tree pruning makes it an approximate (but highly accurate in practice) nearest neighbor search
The library includes a comprehensive property-based test suite covering:
- Core distance and tree construction properties
- Similarity search correctness
- Edge cases and tree structure invariants
Run tests with:
stack test- Jiang, Z. et al., “Low-Resource” Text Classification: A Parameter-Free Classification Method with Compressors, https://aclanthology.org/2023.findings-acl.426/
- Cleary, J. G., & Trigg, R. H. (1979). "K-W-nearest neighbor classification for statistically dependent data"
- Yianilos, P. N. (1993). "Data structures and algorithms for nearest neighbor search in general metric spaces"
- Li, M., Chen, X., Li, X., Ma, B., & Vitányi, P. M. (2004). "The similarity metric"
- Cilibrasi, R., & Vitányi, P. M. (2005). "Clustering by compression". https://arxiv.org/abs/cs/0312044 IEEE Transactions on Information Theory, 51(4), 1440-1454.