Skip to content

Substring extraction from Huffman via wavelet tree #42

@Malkovsky

Description

@Malkovsky

With WT constructed via Huffman tree it can be used as an alternative way of storing Huffman encoded files, as a bonus WT allows extraction of arbitrary substring [l, r] in O(r-L) i.e. only scanning related positions.

Please note the construction and serialization/deserialization of SDSL
https://github.com/simongog/sdsl-lite/blob/master/include/sdsl/wt_pc.hpp
And AI review of the process
https://app.kilo.ai/s/605b1323-63b1-430e-be01-69dda2718f71

It is good present the result in a form of binary utility that work in two modes:

  • Compression: given the input file construct WT with Huffman shape and serialize it.
  • Substring extraction: Extract content from [l, r] of the initial file from a WT archieve.
  • Lines extraction: Extract content of lines [l, r] of the initial file from a WT archieve. This is done the same way as previous with addition that we need additional two selects of '\n'

As of now Pixie doesn't have any CLI tools, usage of argparse is encouraged via CMake FetchContent.

Metadata

Metadata

Labels

FeatureNew functionality

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions