Skip to content

Hawkynt/CompressionWorkbench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

79 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CompressionWorkbench

CI Release Latest release Latest nightly License Language Last CommitActivity Downloads

A fully clean-room C# implementation of compression primitives, archive file formats, and analysis tools. Every algorithm is implemented from scratch using no external compression source code - only our own primitives.


Vision

CompressionWorkbench exists to answer two kinds of questions about compressed and packaged data, entirely in managed .NET with no native dependency on zlib, liblzma, libarchive, or any other third-party compression library:

  1. "What is this, and what is inside?" — given an arbitrary blob of bytes, identify the format, slice it into its logical payloads, and recover the original data.
  2. "How does the algorithm work, and how does it compare?" — provide a reference implementation of every major compression primitive, from LZ77 through arithmetic coding to modern neural / context-mixing compressors, so the algorithms can be read, benchmarked, and taught from a single codebase.

Concretely that means:

  • Clean-room, from-scratch C#. Every primitive — bit I/O, Huffman, range coding, LZ family, BWT/MTF, PPM, context mixing, modern ANS/FSE — is written from the original specification or from a clean reverse of the reference algorithm. No line of native compression code is linked in or ported.
  • Every common container, read and written wherever a spec exists to write against honestly. When the writer cannot match an external spec (proprietary element streams, missing on-disk structures), that is documented in the support tables instead of shipping a silent toy.
  • Every multi-payload container treated as an archive. The distinction that matters to a user is "can I list and extract the N things inside?", not "is this called ZIP". That makes PE resource DLLs, multi-page TIFFs, font collections, multi-frame GIFs, PSD layer stacks, and MPEG transport streams all first-class archives — see Archives and Pseudo-archives below.
  • Analysis as a first-class surface. Identification, entropy mapping, trial decompression, chain reconstruction, signature scanning, and cross-validation against external tools are exposed through a library (Compression.Analysis), a CLI (cwb), and a UI visualiser — not as an afterthought.
  • Benchmarking at the primitive level. The benchmark compares the building blocks — raw algorithms without container overhead — so ratio/speed numbers reflect the algorithm, not the envelope.
  • One library, many surfaces. CLI archiver (cwb), UI browser + analyser, Explorer shell integration, self-extracting stubs (Compression.Sfx.*), and a library any .NET consumer can link.

Archives and Pseudo-archives

Any format that packages N discrete, separately-addressable payloads is an archive.

A format earns archive treatment — the IArchiveFormatOperations contract (List / Extract / optional Create) — whenever its binary layout contains:

  1. A directory or index of named or indexed entries, and
  2. Each entry can be extracted as an independent blob, and
  3. A consumer might plausibly want one entry without the others.

This is true regardless of whether the entries happen to be files, images, pages, frames, tracks, layers, tables, fonts, strings, or other domain objects. The contents of an extracted blob remain domain-specific (a TIFF page is still a TIFF, an RT_ICON resource is still an icon), but that is a property of the payload, not of the container.

Real archives

Formats in the canonical archive sense — ZIP, TAR, 7z, RAR, CAB, CPIO, and their relatives. They were designed as "a bag of files with a directory". These are covered in the Archive Formats, ZIP-Derived Containers, OLE2 Compound File Variants, Compound Formats, and Modern Packaging tables.

Pseudo-archives

Formats that are archives by structure but have never been presented that way in ordinary file managers. CompressionWorkbench slices each one along its natural payload boundary and exposes the same List / Extract surface as ZIP.

Container Entries become Where shipped
PE resource DLLs/EXEs one entry per resource: RT_GROUP_ICON.ico, RT_BITMAP.bmp, RT_MANIFEST.xml, RT_STRING.txt, RT_VERSION.rcv, raw RT_RCDATA FileFormat.PeResources, FileFormat.ResourceDll
ICO / CUR / ANI one entry per ICONDIRENTRY.png / .bmp (cursor adds hotspot) FileFormat.Ico, FileFormat.PngCrushAdapters.Ani
Multi-page TIFF / BigTIFF one single-page .tif per IFD FileFormat.PngCrushAdapters.Tiff / BigTiff
Multi-frame GIF / MNG / FLI / DCX one .gif / .png per frame FileFormat.Gif, PngCrushAdapters.{Mng,Fli,Dcx}
Animated PNG (APNG) one .png per frame with dispose/blend applied against previous frames FileFormat.PngCrushAdapters.Apng
Icon containers (ICNS, MPO) Apple icon suite / stereoscopic JPEG pair FileFormat.PngCrushAdapters.{Icns,Mpo}
Font collections (TTC / OTC) one .ttf / .otf per member font FileFormat.FontCollection
Single-font (TTF / OTF) per-glyph entries (cmap + glyf slicing; CFF/OpenType passes through) FileFormat.FontCollection.Ttf
Gettext MO / PO one .txt per msgid/msgstr pair FileFormat.Gettext
WAV / FLAC / MP3 full file + per-channel WAV + ID3v2/RIFF metadata + APIC cover art FileFormat.Wav, FileFormat.Flac, FileFormat.Mp3
Ogg per-logical-stream packets + Vorbis/Opus comments FileFormat.Ogg
MP4 / MOV / MKV / WebM demuxed tracks (H.264 → Annex-B), attachments, chapters FileFormat.Mp4, FileFormat.Matroska
MPEG Transport Stream per-PID elementary streams (video/audio/data) FileFormat.MpegTs
Blu-ray PGS (SUP) subtitle segments grouped by epoch FileFormat.Sup
VobSub (DVD) .idx metadata + per-entry slices of the sibling .sub PES stream FileFormat.VobSub
HLS M3U8 segment list with per-variant metadata FileFormat.M3u8
U-Boot uImage, FDT/DTB, UEFI FV firmware header metadata + decompressed payload or per-FFS/property entries FileFormat.UImage, FileFormat.Dtb, FileFormat.UefiFv
Device executable packers the packer's metadata.ini (detection evidence) + packed_payload.bin (or in-process decompressed body for UPX) FileFormat.ExePackers

Honest failure

Formats that cannot produce multiple addressable entries stay in FormatCategory.Stream rather than falsely advertising themselves as archives. IArchiveFormatOperations.List is free to return a single "whole payload" entry for stream-style containers (and does, for formats like PAQ8 or the audio-stream-as-archive descriptors), but a format that would have to fake an index has no business claiming SupportsMultipleEntries.


Solution Structure

The solution uses the .slnx XML format. All projects sit at the repository root — no src/ or tests/ directories. Solution folders in the IDE group projects logically; the filesystem stays flat so git log --follow works on every file.

CompressionWorkbench.slnx
|
+-- Compression.Core                 Primitives, building blocks, SIMD, partition parsers
+-- Compression.Registry             Interfaces (IFormatDescriptor, IBuildingBlock) + registries
+-- Compression.Registry.Generator   Roslyn source generator for auto-discovery
+-- Compression.Lib                  Umbrella library: detection, archive ops, SFX hosting
+-- Compression.Analysis             Binary analysis engine (signatures, entropy, trial decomp)
+-- Compression.CLI                  `cwb` command-line tool (System.CommandLine v3)
+-- Compression.UI                   WPF browser + analyser + heatmap + wizard
+-- Compression.Shell                Explorer context-menu integration
+-- Compression.Sfx.Cli              Self-extracting archive stub (console)
+-- Compression.Sfx.Ui               Self-extracting archive stub (GUI)
+-- Compression.Tests                NUnit test project (tests)
|
+-- FileFormat.*                     One project per archive / stream / pseudo-archive / packer
+-- FileSystem.*                     One project per filesystem image format
+-- Codec.*                          Standalone audio codecs (PCM/FLAC/A-law/µ-law/GSM/ADPCM/MIDI/MP3/Vorbis/Opus/AAC)

Adding a new format is a three-step process:

  1. Create a FileFormat.<Name>/ or FileSystem.<Name>/ project with a class implementing IFormatDescriptor (and IStreamFormatOperations or IArchiveFormatOperations as appropriate).
  2. Add a <ProjectReference> from Compression.Lib.csproj.
  3. Add the project to CompressionWorkbench.slnx.

The Roslyn source generator (Compression.Registry.Generator) discovers every implementation at compile time and emits the registration table. No reflection, no hand-maintained switch statements, no init hooks.

Technology stack

Concern Choice
Language C# 14 / .NET 10
Solution .slnx (XML solution format)
Testing NUnit
GUI WPF
CLI System.CommandLine v3
Discovery Roslyn source generator (zero-reflection format/block registration)
Bundling Costura.Fody single-file embedding for CLI/UI/SFX

Supported Formats

Capability scale

Level Meaning
Unsupported No descriptor exists.
Read-only Can list and extract; no creation.
WORM Write-Once-Read-Many — can produce a fresh archive/image, cannot modify one in place.
R/W Can also add/replace/remove entries in an existing archive in place (no formats yet).

In tables below, Yes = WORM (or better), - = Read-only. A Reference column links to the authoritative spec or reverse-engineering document the implementation was validated against.

Building blocks

The raw algorithm primitives registered via IBuildingBlock. They operate on ReadOnlySpan<byte> without any container framing — this is the surface the benchmark tool compares. Building blocks live in Compression.Core; they are never wrapped as FileFormat.* projects.

Id Name Family Description Reference
BB_Deflate DEFLATE Dictionary LZ77 + Huffman, the algorithm inside gzip/zip/png RFC 1951
BB_Deflate64 Deflate64 Dictionary Enhanced DEFLATE with 64 KB window and extended codes MS-ZIP spec
BB_Lz77 LZ77 Dictionary Sliding-window dictionary with distance/length tokens Ziv & Lempel 1977 paper
BB_Lz78 LZ78 Dictionary Builds phrase dictionary from input, predecessor to LZW Ziv & Lempel 1978 paper
BB_Lzw LZW Dictionary Lempel-Ziv-Welch dictionary coding, used in GIF and Unix compress Welch 1984 paper
BB_Lzo LZO1X Dictionary Extremely fast dictionary compression optimised for decompression speed oberhumer.com
BB_Lzss LZSS Dictionary LZ77 variant with flag-bit encoding Storer & Szymanski 1982
BB_Lz4 LZ4 Dictionary Extremely fast LZ77-family block compression LZ4 block format
BB_Snappy Snappy Dictionary Fast LZ77-family compression (Google) Snappy format
BB_Brotli Brotli Dictionary Modern LZ77 + Huffman with static dictionary (Google) RFC 7932
BB_Lzma LZMA Dictionary Lempel-Ziv-Markov chain with range coding 7-Zip LZMA SDK
BB_Lzx LZX Dictionary LZ77 + Huffman used in CAB/CHM/WIM MS-PATCH LZX spec
BB_Xpress XPRESS Huffman Dictionary Windows XPRESS (NTFS/WIM/Hyper-V) MS-XCA spec
BB_Lzh LZH (LH5) Dictionary Lempel-Ziv with adaptive Huffman, used in LHA LZH format doc
BB_Arj ARJ Dictionary Modified LZ77 + Huffman used in ARJ archives ARJ technical info
BB_Lzms LZMS Dictionary LZ + Markov + Shannon with delta matching (Windows WIM/ESD) MS-XCA LZMS
BB_Lzp LZP Dictionary Lempel-Ziv Prediction, context-based match prediction Bloom 1996
BB_Ace ACE Dictionary LZ77 + Huffman from ACE archive format unace-nonfree
BB_Rar RAR5 Dictionary LZ + Huffman + PPM from RAR5 rarlab technote
BB_Sqx SQX Dictionary LZ + Huffman from the SQX archive format SqxFormat notes
BB_ROLZ ROLZ Dictionary Reduced-Offset LZ with context-based match tables encode.su discussion
BB_PPM PPM Context Mixing Prediction by Partial Matching, order-2 context modelling Cleary & Witten 1984
BB_CTW CTW Context Mixing Context Tree Weighting — optimal universal compression Willems 1995 paper
BB_LZHAM LZHAM Dictionary LZ77 + Huffman, inspired by Valve's LZHAM codec LZHAM repo
BB_Lzs LZS Dictionary Stac LZS (7/11-bit offset LZSS for networking) RFC 1967 / RFC 2395
BB_Lzwl LZWL Dictionary LZW with variable-length initial alphabet from digram analysis LZWL paper
BB_RePair Re-Pair Dictionary Recursive Pairing, offline grammar-based compression Larsson & Moffat 1999
BB_842 842 Dictionary IBM 842 hardware compression with 2/4/8-byte template matching Linux crypto/842*
BB_Huffman Huffman Entropy Optimal prefix-free entropy coding using symbol frequencies Huffman 1952
BB_Arithmetic Arithmetic Entropy Order-0 arithmetic coding with frequency table Witten, Neal & Cleary 1987
BB_ShannonFano Shannon-Fano Entropy Historical predecessor to Huffman, recursive frequency splitting Shannon 1948
BB_Golomb Golomb/Rice Entropy Optimal coding for geometric distributions Golomb 1966
BB_Fibonacci Fibonacci coding Entropy Universal code using Zeckendorf representation with 11 terminators Apostolico & Fraenkel 1987
BB_FSE FSE/tANS Entropy Table-based Asymmetric Numeral Systems, used in Zstd Duda 2013 paper / Collet's blog
BB_BPE Byte Pair Encoding Entropy Iterative most-frequent pair replacement Gage 1994
BB_RangeCoding Range coding Entropy Byte-oriented arithmetic coding variant with carryless normalisation Martin 1979
BB_rANS rANS Entropy Range ANS coder, used in AV1 and LZFSE Duda 2013 paper
BB_ExpGolomb Exp-Golomb Entropy Exponential Golomb, used in H.264/H.265 Teuhola 1978
BB_Unary Unary Entropy Simplest universal code: N ones followed by a zero
BB_EliasGamma Elias gamma Entropy Universal code using unary length prefix Elias 1975
BB_EliasDelta Elias delta Entropy Gamma-codes the bit length Elias 1975
BB_Levenshtein Levenshtein coding Entropy Self-delimiting universal code with recursive length prefixing Levenshtein 1968
BB_Tunstall Tunstall coding Entropy Variable-to-fixed code, dual of Huffman Tunstall 1967 (PhD thesis)
BB_Dmc DMC Entropy Dynamic Markov Compression, bit-level FSM with state cloning Cormack & Horspool 1987
BB_Bwt BWT Transform Burrows-Wheeler Transform, reorders bytes for better compression Burrows & Wheeler 1994
BB_Mtf MTF Transform Move-to-Front Transform Bentley et al. 1986
BB_Delta Delta Transform Delta filter, stores differences between consecutive bytes
BB_Rle RLE Transform Run-Length Encoding
BB_Dpcm DPCM Transform Differential PCM, stores sample-to-sample differences

The benchmark command (cwb benchmark <file> or the UI's Benchmark Tool) runs every building block over the supplied data, records ratio + compress/decompress times, and ranks the results.

Archive formats

Format Extensions Read Write Reference Notes
ZIP .zip Yes Yes APPNOTE.TXT Store, Deflate, Deflate64, Shrink, Reduce, Implode, BZip2, LZMA, PPMd, Zstd, AES
RAR .rar Yes Yes (v4/v5) rarlab technote v1-v5 decoders, solid, multi-volume, encryption, recovery
7z .7z Yes Yes 7-Zip format LZMA/LZMA2, Deflate, BZip2, PPMd, BCJ/BCJ2, AES-256, multi-volume
TAR .tar Yes Yes POSIX ustar POSIX/GNU/PAX, multi-volume
CAB .cab Yes Yes MS-CAB MSZIP, LZX, Quantum
LZH/LHA .lzh,.lha Yes Yes LHA archive format lh0-lh7, lzs, lh1-lh3 (adaptive Huffman), pm0-pm2
ARJ .arj Yes Yes ARJ technical Methods 0-4, garble encryption
ARC .arc Yes Yes ARC format Methods 0-9 (RLE, LZW, Squeeze, Huffman)
ZOO .zoo Yes Yes zoo format LZW, LZH
ACE .ace Yes Yes ACE unofficial spec ACE 1.0/2.0, solid, sound/picture filters, Blowfish, recovery
SQX .sqx Yes Yes SQX disassembly LZH, multimedia, audio, solid, AES-128, recovery
CPIO .cpio Yes Yes cpio(5) Binary, odc, newc, CRC
AR .ar Yes Yes ar(5) Unix archive
WIM .wim Yes Yes Imagex WIM format LZX, XPRESS
RPM .rpm Yes Yes RPM spec CPIO payload
DEB .deb Yes Yes deb(5) AR+TAR with gz/xz/zst/bz2
Shar .shar Yes Yes GNU sharutils Shell archive
PAK .pak Yes Yes PAK spec ARC-compatible
HA .ha Yes Yes HA specification HSC/ASC arithmetic coding
ZPAQ .zpaq Yes Yes ZPAQ spec PDF Context mixing, journaling
StuffIt .sit Yes Yes libxad sit.c Multiple methods
StuffIt X .sitx Yes Yes XADMaster StuffItX Detection-only; WORM emits a valid StuffIt! envelope (proprietary element-stream writer not implemented)
SquashFS .sqfs Yes Yes SquashFS 4.0 spec Filesystem image
CramFS .cramfs Yes Yes Linux fs/cramfs/ Filesystem image
NSIS .exe Yes Yes NSIS wiki Installer extraction + WORM emits overlay-only data (no PE stub)
Inno Setup .exe Yes Yes innounp Installer extraction + WORM emits signature header (no PE stub)
DMS .dms Yes Yes xDMS source Amiga disk archiver
LZX (Amiga) .lzx Yes Yes Amiga LZX format Amiga LZX
Compact Pro .cpt Yes Yes XADMaster cpt.c Classic Mac format
Spark .spark Yes Yes RISC OS Spark RISC OS format
LBR .lbr Yes Yes CP/M LBR CP/M format
UHARC .uha Yes Yes UHARC docs LZP compression
WAD (Doom) .wad Yes Yes Doom Wiki WAD Doom WAD format
WAD2/WAD3 .wad Yes Yes Quake Wiki WAD Quake/Half-Life texture archive
XAR .xar Yes Yes XAR on-disk format Apple .pkg (zlib TOC)
ALZip .alz Yes Yes ALZ format Korean archive (Deflate)
VPK .vpk Yes Yes Valve VPK Valve game archive
BSA/BA2 .bsa,.ba2 Yes Yes BSA format Bethesda game archive
MPQ .mpq Yes Yes ZezulaMPQ docs Blizzard — WORM v1 with stored entries, encrypted hash+block tables, self-referential (listfile)
GRP .grp Yes Yes BUILD Engine docs BUILD Engine (Duke Nukem 3D)
HOG .hog Yes Yes Descent HOG Descent game archive
BIG .big Yes Yes EA BIG format EA Games (C&C, FIFA)
Godot PCK .pck Yes Yes Godot PCK spec Godot Engine resource pack
WARC .warc Yes Yes ISO 28500 Web archive — WORM emits one resource record per input file
NDS .nds Yes Yes GBATEK NDS Nintendo DS ROM — WORM emits valid NitroFS (no ARM9/ARM7 boot code)
NSA .nsa Yes Yes NScripter docs NScripter — WORM writes stored entries (compression type 0)
SAR .sar Yes Yes NScripter docs NScripter — uncompressed variant of NSA
PackIt .pit Yes - XADMaster packit.c Classic Mac format
DiskDoubler .dd Yes Yes XADMaster DD Classic Mac compression — WORM stores data fork (method 0)
MSI .msi Yes Yes MS-CFB OLE Compound File — WORM produces a CFB envelope (not a functional Installer DB)
PDF .pdf Yes Yes ISO 32000 Image extraction + WORM via file attachments (EmbeddedFiles) — any file type round-trips
TNEF .tnef,.dat Yes Yes MS-OXTNEF Outlook winmail.dat
Split File .001 Yes Yes Multi-part file joining/splitting
FreeArc .arc Yes Yes FreeArc source FreeArc archive
CHM .chm Yes Yes CHM file format MS Compiled HTML Help — WORM stores files in section 0 (uncompressed); LZX compression available via options
Wrapster - Yes - XADMaster wrapster.c MP3 wrapper archive
LhF .lhf Yes Yes XADMaster Amiga LhFloppy disk (LZH-compressed tracks)
ZAP .zap Yes Yes XADMaster Amiga disk archiver — WORM writes stored tracks
PackDisk .pdsk Yes Yes XADMaster Amiga PackDisk — WORM writes stored tracks. Same writer covers DCS / xDisk / xMash via different magics.
AMPK - Yes - XADMaster Amiga AMPK
IFF-CDAF - Yes - IFF spec IFF-CDAF archive
UMX .umx Yes Yes Beyond Unreal wiki Unreal package — WORM emits valid header (detection-only)

ZIP-derived containers

All delegate to the ZIP reader/writer. WORM (Yes) means a fresh container can be produced with the correct internal layout.

Format Extensions Read Write Reference Notes
JAR .jar Yes Yes JAR spec Java archive
WAR .war Yes Yes Java EE WAR Java web archive
EAR .ear Yes Yes Java EE EAR Java enterprise archive
APK .apk Yes Yes Android APK Android package
IPA .ipa Yes Yes Apple IPA bundle iOS package
APPX .appx,.msix Yes Yes MS-APPXPKG Windows package
XPI .xpi Yes Yes Mozilla XPI Firefox extension
CRX .crx Yes Yes Chrome CRX3 Chrome extension — WORM emits unsigned CRX3 envelope (browser rejects signature)
EPUB .epub Yes Yes EPUB 3 spec eBook
MAFF .maff Yes Yes MAFF spec Mozilla Archive Format
KMZ .kmz Yes Yes KML spec Google Earth
NuPkg .nupkg Yes Yes NuGet spec NuGet package
DOCX .docx Yes Yes ECMA-376 OOXML Word
XLSX .xlsx Yes Yes ECMA-376 OOXML Excel
PPTX .pptx Yes Yes ECMA-376 OOXML PowerPoint
ODT .odt Yes Yes OASIS ODF OpenDocument Text
ODS .ods Yes Yes OASIS ODF OpenDocument Spreadsheet
ODP .odp Yes Yes OASIS ODF OpenDocument Presentation
CBZ .cbz Yes Yes Comic book archive Comic book ZIP
CBR .cbr Yes Yes Comic book archive Comic book RAR — delegates to RarWriter

OLE2 Compound File variants

Microsoft binary-office formats built on the OLE2 / Compound File Binary (CFB) container. WORM creation produces a structurally-valid CFB envelope (that round-trips through our reader and other permissive CFB tools like libgsf / Apache POI) but is not a real Word/Excel/PowerPoint/Outlook document — those require generating each application's internal binary stream layout, which is out of scope. Limitations: ~6.8 MB total file size (109 FAT sectors, no DIFAT chain), single root storage, stream names ≤ 31 UTF-16 chars.

Format Extensions Read Write Reference Notes
DOC .doc Yes Yes MS-DOC Word 97-2003 (CFB envelope, not a real Word document)
XLS .xls Yes Yes MS-XLS Excel 97-2003 (CFB envelope, not a real workbook)
PPT .ppt Yes Yes MS-PPT PowerPoint 97-2003 (CFB envelope, not a real presentation)
MSG .msg Yes Yes MS-OXMSG Outlook message (CFB envelope, not real MAPI properties)
Thumbs.db Thumbs.db Yes Yes Forensics docs Windows thumbnail cache (CFB envelope, not real Catalog layout)
MSI .msi Yes Yes MS-MSI Windows Installer (CFB envelope, not a functional Installer DB)

Compression stream formats

Single-stream compressors. Compress/Decompress indicate the two halves of the algorithm.

Format Extensions Compress Decompress Reference
Gzip .gz Yes Yes RFC 1952
BZip2 .bz2 Yes Yes bzip2 source
XZ .xz Yes Yes XZ format
Zstandard .zst Yes Yes RFC 8878
LZ4 .lz4 Yes Yes LZ4 frame format
Brotli .br Yes Yes RFC 7932
Snappy .sz,.snappy Yes Yes Snappy framing
LZOP .lzo Yes Yes lzop source
compress (.Z) .Z Yes Yes ncompress
LZMA .lzma Yes Yes 7-Zip LZMA SDK
Lzip .lz Yes Yes lzip format
Zlib .zlib Yes Yes RFC 1950
SZDD .sz_ Yes Yes compress.exe format
KWAJ - Yes Yes MS compress formats
RZIP .rz Yes Yes rzip docs
MacBinary .bin Yes Yes RFC 1740
BinHex .hqx Yes Yes RFC 1741
Squeeze .sqz Yes Yes Squeeze format
PowerPacker .pp Yes Yes Amiga PP20
ICE Packer .ice Yes Yes Atari ST ICE
PackBits .packbits Yes Yes Apple PackBits
Yaz0 (SZS) .yaz0,.szs Yes Yes Nintendo Yaz0 RE
BriefLZ .blz Yes Yes BriefLZ source
RNC .rnc Yes Yes Rob Northen RE
RefPack / QFS .qfs,.refpack Yes Yes RefPack RE
aPLib .aplib Yes Yes aPLib docs
LZFSE .lzfse Yes Yes Apple LZFSE source
Freeze .f,.freeze Yes Yes Unix Freeze
uuencoding .uu,.uue Yes Yes POSIX uuencode
yEnc .yenc Yes Yes yEnc spec
Density .density Yes Yes Density source
LZG .lzg Yes Yes LZG source
BCM .bcm Yes Yes BCM source
BSC .bsc Yes Yes libbsc
BALZ .balz Yes Yes BALZ source
CSC .csc Yes Yes CSC source
Zling .zling Yes Yes libzling
Lizard .lizard Yes Yes Lizard source
QuickLZ .quicklz Yes Yes QuickLZ docs
cmix .cmix Yes Yes cmix source
MCM .mcm Yes Yes MCM source
PAQ8 .paq8 Yes Yes Matt Mahoney PAQ page
SWF .swf Yes Yes SWF 19 spec
CP/M Crunch .cru Yes Yes CP/M CRUNCH
PPMd .pmd Yes Yes Shkarin PPMd
LZHAM .lzham Yes Yes LZHAM source
LZS .lzs Yes Yes RFC 1967 / RFC 2395
FLAC .flac Yes Yes FLAC format

Compound formats

tar.gz, tar.bz2, tar.xz, tar.zst, tar.lz4, tar.lz, tar.br — auto-detected, both read and write.

Modern packaging

Format Extensions Read Write Reference Notes
AppImage .AppImage Yes - AppImage spec ELF stub + appended SquashFS; offset located by ELF section-end + magic scan
Snap .snap Yes - snapd source SquashFS with meta/snap.yaml
MSIX .msix,.msixbundle Yes - MSIX spec Modern Windows app package (mirrors APPX)
ESD .esd Yes - WIM/ESD overview Windows Update encrypted-LZMS WIM; shares MSWIM\0\0\0 magic, extension-only
Split WIM .swm,.swmN Yes - WIM spec Multi-part WIM volume
WACZ .wacz Yes - WACZ 1.0.0 Web Archive Collection Zipped — ZIP around WARC + datapackage.json
Python Wheel .whl Yes - PEP 427 ZIP with dist-info/METADATA, WHEEL, RECORD
Ruby Gem .gem Yes - gem spec TAR with metadata.gz, data.tar.gz, checksums.yaml.gz
Rust Crate .crate Yes - cargo spec TAR.GZ with single name-version/ directory containing Cargo.toml

Firmware and embedded

Format Extensions Read Write Reference Notes
U-Boot uImage .uimg,.img,.bin Yes - U-Boot image.h 64-byte legacy header + body; reports OS/arch/comp; decompresses payload when possible
Device Tree Blob .dtb,.dtbo Yes - DT spec FDT v17, walks property tree as pseudo-archive
Intel HEX .hex,.ihex Yes - Intel HEX spec ASCII firmware records, decoded to flat firmware.bin + metadata
Motorola S-Record .s19,.s28,.s37,.srec,.mot Yes - SREC spec 16/24/32-bit address records
TI-TXT - Yes - MSP430 programming MSP430 firmware text, address blocks
UEFI Firmware Volume .fv,.fd,.rom,.bin Yes - UEFI PI vol.3 _FVH at offset 40, walks FFS files

Disk-image + forensics

Format Extensions Read Write Reference Notes
VHDX .vhdx Yes - MS VHDX spec Hyper-V modern; surfaces File Type ID + 2 headers + 2 region tables (BAT walk deferred)
EWF/E01 .e01,.ewf,.l01 Yes - libewf docs EnCase forensic image; section-chain walker, header2 + MD5/SHA1
G64 .g64 Yes - VICE G64 docs Commodore GCR-encoded track dump (1541)
NIB .nib Yes - nibtools docs Commodore raw nibble track dump

Scientific and ML

Format Extensions Read Write Reference Notes
NumPy NPY .npy Yes - NEP 1 / npy-format Single ndarray header + raw bytes
NumPy NPZ .npz Yes - savez docs ZIP of NPYs
NIfTI-1/2 .nii,.nii.gz Yes - NIfTI spec Medical imaging (MRI); 352-byte v1 / 540-byte v2 header + voxel data; transparent gzip
HDF4 .hdf,.hdf4,.h4 Yes - HDF4 reference DD linked-list walker, tag histogram + per-DD entry
ONNX .onnx Yes - ONNX proto Pure-C# protobuf reader; surfaces graph initializers as entries

CAD / 3D

Format Extensions Read Write Reference Notes
STL .stl Yes - STL spec ASCII + binary; triangle count, bounding box, name
PLY .ply Yes - Stanford PLY ASCII / binary LE/BE, element schema
DXF .dxf Yes - Autodesk DXF ref AutoCAD ASCII; section list + entity histogram
Collada .dae Yes - Khronos Collada 1.5 XML 3D interchange
3DS .3ds Yes - lib3ds docs Autodesk binary chunks

Medical imaging

Format Extensions Read Write Reference Notes
DICOM .dcm Yes - NEMA DICOM PS3 Single DICOM image
DICOMDIR .dcmdir,DICOMDIR Yes - DICOM PS3.10 Multi-study patient/series index referencing sibling DICOM files

Streaming and subtitle

Format Extensions Read Write Reference Notes
SUP (PGS) .sup Yes - PGS RE doc Blu-ray Presentation Graphic Stream subtitle segments, grouped by epoch
VobSub .idx + .sub Yes - MPlayer vobsub DVD subtitle pair; parses .idx palette/timestamps + slices sibling .sub PES
HLS M3U8 .m3u8,.m3u Yes - RFC 8216 HTTP Live Streaming manifest
MPEG-TS .ts,.m2ts,.mts Yes - ITU-T H.222.0 MPEG-2 Transport Stream demuxed into per-PID elementary streams

Audio codecs

Standalone audio codecs live under Codec.* projects (separate from container-format descriptors). Each exposes a static Decompress(Stream input, Stream output) producing interleaved little-endian PCM and a ReadStreamInfo for metadata-only access. Encoders are explicitly out of scope for the new codecs — only the legacy ones ship encoders.

Codec Project Encoder Decoder state Reference
PCM Codec.Pcm Yes Production — raw integer PCM up to 32-bit
FLAC Codec.Flac Yes Production — FIXED + LPC subframes, all sample rates / bit depths xiph.org/flac
A-law Codec.ALaw Yes Production — G.711 ITU-T G.711
μ-law Codec.MuLaw Yes Production — G.711 ITU-T G.711
GSM 06.10 Codec.Gsm610 Yes Production — full RPE-LTP ETSI GSM 06.10
IMA ADPCM Codec.ImaAdpcm Yes Production — Microsoft + Apple variants IMA ADPCM spec
MS ADPCM Codec.MsAdpcm Yes Production — WAV format 0x0002 MS ADPCM spec
MIDI Codec.Midi Yes Production — SMF 0/1/2 with all standard meta + channel events MIDI 1.0 spec
MP3 Codec.Mp3 - Header + framing complete; bit-exact decode unverified — minimp3 port (1469 LOC, scalar) covering MPEG-1/2/2.5 Layer III, MS+intensity stereo, ID3v2 skip, Xing VBR. Layer I/II rejection passes. End-to-end PCM decode against a reference clip is deferred until an MP3 test vector lands in test-corpus/. ISO/IEC 11172-3 / minimp3
Vorbis Codec.Vorbis - Partial — stb_vorbis structural port (1295 LOC) covering Ogg page reassembly, codebooks (lookup 0/1/2), floor 1, residue 0/1/2, channel coupling, IMDCT. Floor 0 throws NotSupportedException. End-to-end test marked Inconclusive until a test vector lands in test-corpus/. Vorbis I spec
Opus Codec.Opus - Framing only — Ogg page walker + OpusHead/OpusTags + TOC byte + frame packing modes 0/1/2/3 + range decoder (ec_dec) all real. CELT and SILK pipelines are stubs that emit silence at the correct sample count. Hybrid mode throws NotSupportedException. RFC 6716
AAC-LC Codec.Aac - Framing only — ADTS frame parser + AudioSpecificConfig + element dispatcher + profile gating real. Spectral pipeline + Huffman tables + IMDCT + filterbank scaffolded but spectral data tables are TODO. HE-AAC v1/v2 + Main/SSR/LTP/ER all throw NotSupportedException. ISO/IEC 14496-3

Implementation philosophy. The four new audio codecs (MP3 / Vorbis / Opus / AAC-LC) ship under the project's "no toy implementations" rule — partial state is documented openly (in class summaries, in Assert.Ignore messages, and in this table) rather than silently producing wrong PCM. Future work: bit-pack debugging for MP3, real CELT/SILK for Opus, spectral table population for AAC, reference test-vector validation across all four.

Known partial implementations

Code paths that throw NotSupportedException or NotImplementedException rather than silently producing wrong output. Documented here so expectations match behaviour.

Area State
MP3 / Vorbis / Opus / AAC-LC Partial decoder state — see Audio Codecs table. MP3 bit-exact needs test vectors; Vorbis floor 0 throws (obsolete since 2004 — stb_vorbis doesn't implement it either); Opus CELT/SILK + AAC spectral filterbank are multi-week DSP projects
LZFSE V1 / V2 blocks FSE/tANS backend not implemented — uncompressed (bvxn) + LZVN blocks work. Full LZFSE needs ~1500 LOC new code (Apple reference impl)
ZPAQ Reader requires a ZPAQL virtual machine (not implemented). Multi-week bytecode-VM project
StuffIt X writer Proprietary element-catalog / P2-varint writer not implemented — WORM emits valid StuffIt! envelope shell. No public spec, only reverse-engineering notes
UMX writer Full export table + compact-index music encoding not implemented — WORM emits valid header only
OLE2 application streams (DOC/XLS/PPT/MSG/ThumbsDb/MSI) CFB envelope round-trips through our reader and libgsf/Apache POI, but the internal WordDocument / WorkBook / PowerPoint Document / MAPI / Catalog / Installer-DB streams are not synthesised. Each is a 400+ page MS Open Specification
Inno Setup reader Individual file extraction from Setup.1 not implemented for some installer versions
EROFS Compressed layouts (LZ4/LZMA-compressed inodes) not decompressed — plain-storage inodes work
ExtRemover Indirect-block traversal not implemented for file removal (direct blocks work)
F2FS writer Indirect-block allocation not implemented — per-file max ≈ 3.6 MB (923 direct pointers in inode, no direct_node/indirect_node chain)
RAR create Only v4 and v5 archive creation are implemented

Fixes landed in this pass (documented here so the list above is what's actually pending):

Area Before After
CAB LZX Enum marked "(not implemented)" Comment was stale — reader (LzxDecompressor) and writer (BB_Lzx) were already wired
MPQ bzip2 (method 0x10) Returned payload raw Now invokes Bzip2Stream on the payload, falls back to raw on decode failure
FAT32 writer Threw NotSupportedException for images ≥ 65525 clusters Full FATGEN103-compliant FAT32: extended BPB (BPB_RootClus/BPB_FSInfo/BPB_BkBootSec), FSInfo sector with the three canonical signatures, backup boot sector at sector 6, cluster-2 root directory with EoC marker, 32 reserved sectors, FS-type string FAT32
ProDOS tree storage Files > 128 KB rejected outright Writer emits storage-type-3 trees: master index block + up to 256 subordinate index blocks → 32 MB per file. Reader already handled type 3

Filesystem images

Snapshot: 41 filesystems, 37 read+write, 4 read-only. Spec = the external document/source the writer was validated against.

Windows / DOS native

FS State Spec Notes
FAT12/16/32 R/W Microsoft FATGEN Full BPB, 0x55 0xAA signature, auto-select FAT12/16/32 by cluster count. FAT32 includes extended BPB, FSInfo sector + backup boot sector + cluster-2 root directory
exFAT R/W Microsoft exFAT spec Full VBR, boot-checksum sector (§3.1.3), Upcase/Bitmap/VolumeLabel root entries
NTFS R/W MS-NT on-disk / TSK docs All 16 system MFT files, USA fixup, LZNT1 compression
DoubleSpace / DriveSpace CVF R/W MS-DOS 6 Technical Reference Full MDBPB with DBLS/DVRS signature, MDFAT + BitFAT, inner FAT12/16 with VFAT LFN. Stored runs only (JM/LZ77 is TODO)
HPFS RO OS/2 Inside Story Read-only descriptor (no writer)

Unix / Linux

FS State Spec Notes
ext2/3/4 R/W Linux kernel fs/ext2/ext2.h Spec-compliant, random UUID; FS revision 0 GOOD_OLD
XFS v5 R/W Linux fs/xfs/libxfs/xfs_format.h v5 with v3 dinodes, sb_crc CRC-32C, sb_features_*/sb_meta_uuid/sb_pquotino
JFS R/W Linux fs/jfs/jfs_superblock.h pxd_t bit-packing (24-bit length + 40-bit address), inline dtree root, aggregate inode table with FILESYSTEM_I=16
ReiserFS 3.6 R/W Linux fs/reiserfs/reiserfs.h Spec-correct offsets, ReIsEr2Fs @+52, leaf block-head. No block CRC (v3.6 doesn't have them)
F2FS R/W Linux include/linux/f2fs_fs.h Superblock magic at block-offset 0x400, CP + SIT + NAT + SSA + Main, CRC-32C, inline dentries in root inode
Btrfs R/W Btrfs on-disk format Real chunk tree (SYSTEM/METADATA/DATA), sys_chunk_array in superblock, DEV_ITEM, CRC-32C on every block header
ZFS R/W OpenZFS source 4 vdev labels, 128-entry uberblock ring with Fletcher-4, XDR NVList, MOS + DSL dir/dataset + microzap for root, pool version 28
UFS1/FFS R/W FreeBSD sys/ufs/ffs/fs.h fs_magic=0x011954 at sb offset 1372, cg_magic, fs_cs summary block
UBIFS RO Linux fs/ubifs/ Read-only; no writer (LPT/TNC trees are multi-week)
JFFS2 RO Linux fs/jffs2/ Read-only; log-structured node-scanner only
YAFFS2 RO Aleph One YAFFS2 spec Read-only; OOB/ECC layout not emittable
BFS (BeOS/Haiku) RO Haiku OS source Read-only; superblock surfacing only

Apple / classic Mac

FS State Spec Notes
HFS classic R/W Apple "Inside Macintosh: Files" (1992) Real B-tree catalog + extents trees, 102-byte file records, 70-byte dir records, 46-byte thread records with (parent, name) sort
HFS+ R/W Apple TN1150 Catalog file record at spec 248 bytes; dataFork @ 88, resourceFork @ 168
APFS R/W Apple File System Reference NX superblock + container OMAP + APSB volume + FS B-tree, Fletcher-64 checksums, single-container WORM
MFS R/W Inside Macintosh V (1985) Pre-HFS flat FS; drSigWord=0xD2D7

Retro / 8-bit

FS State Spec Notes
Commodore 1541 (.d64) R/W VICE emulator docs 174 848 bytes, 35 tracks, directory at T18S1+
Commodore 1571 (.d71) R/W VICE docs 349 696 bytes, dual-side BAM
Commodore 1581 (.d81) R/W VICE docs 819 200 bytes, 80 × 40 × 256, DOS "3D" signature
C64 tape (.t64) R/W T64 format spec "C64S tape image file" header
Amiga ADF (OFS/FFS) R/W Amiga ROS docs 901 120 (DD) / 1 802 240 (HD), "DOS\1" magic, BSDsum checksums
Amiga DMS R/W xDMS source "DMS!" header with CRC16
Atari ST MSA R/W MSA format spec 0x0E0F BE magic, per-track RLE
Atari 8-bit ATR R/W AtariDOS 2 VTOC 16-byte header + 92 160 sector bytes, VTOC @ sector 360
Apple DOS 3.3 R/W Apple DOS manual 143 360 bytes, catalog at T17S15 chain, 35-byte entries
ProDOS R/W ProDOS TRM 143 360 (5.25") / 819 200 (800K), 39-byte entries
BBC Micro DFS (.ssd) R/W Acorn DFS spec 102 400 (40-track) / 204 800 (80-track), 31×8-byte dir entries
ZX Spectrum SCL R/W TR-DOS .scl spec "SINCLAIR" magic + LE32 trailing sum
ZX Spectrum TR-DOS (.trd) R/W TR-DOS spec 655 360 bytes, 160×16×256
Amstrad CPC DSK R/W CPCEMU disk format "MV - CPCEMU Disk-File" magic
HP LIF (.lif) R/W HP LIF utility manual 256-byte sectors, flat directory, 0x8000 BE magic
CP/M 2.2 R/W DR CP/M 2.2 BDOS reference 256 256 bytes (77×26×128), 64-entry flat directory
DEC RT-11 (.rt11/.rx01) R/W DEC RT-11 Volume + File Formats RX01 8" SSSD ~256 KB, 512-byte blocks, RAD-50 6.3 names
OS-9 RBF (.os9/.rbf) R/W Microware OS-9 Tech Reference CoCo 35-track DSDD ~315 KB, 256-byte sectors, big-endian
Commodore G64 (.g64) RO VICE emulator docs GCR-encoded track dump; raw GCR bytes per track
Commodore NIB (.nib) RO nibtools docs Raw 84-half-track nibble dump

Optical

FS State Spec Notes
ISO 9660 R/W ECMA-119 PVD at sector 16, VDST @ 17, L+M path tables, 2 KB blocks, flat directory (no Rock Ridge/Joliet)
UDF R/W ECMA-167 VRS (BEA01/NSR02/TEA01) @ 16-18, Main VDS @ 32-35, AVDP @ 256. CRC-16-XMODEM + TagChecksum on every tag

Embedded / flash

FS State Spec Notes
SquashFs R/W SquashFs 4.0 spec hsqs magic, zlib + Adler-32, FlagNoFragments
CramFs R/W Linux fs/cramfs/ 0x28CD3D45 magic, CRC-32, zlib blocks
RomFs R/W Linux fs/romfs/ -rom1fs- magic, BE fields, self-correcting checksum
EROFS RO Linux fs/erofs/ Read-only; variable-length encoded inodes
Minix v1/2/3 R/W Linux fs/minix/ Superblock magics 0x137F/0x138F/0x2468/0x2478/0x4D5A
VDFS R/W (writer toy) Gothic-game engine reverse engineering Proprietary, no public spec; writer is flat-no-checksum (left in place until spec available)

Disk- and disc-image containers

Containers holding filesystem payloads; the inner FS is a separate descriptor.

Format State Reference Notes
VHD (Microsoft) R/W VHD format spec "conectix" magic, fixed + dynamic
VMDK (VMware) R/W VMDK spec "KDMV" magic
QCOW2 (QEMU) R/W QCOW2 docs Sparse format; WORM wraps raw disk with L1/L2, v2
VDI (VirtualBox) R/W VBox source Single-disk format
BIN/CUE R/W cue sheet Raw disc image; WORM emits ISO 9660 cooked sectors
MDF R/W Alcohol docs Alcohol 120% — WORM emits ISO 9660
NRG R/W Nero format docs Nero — WORM emits ISO 9660 with NER5 footer
CDI R/W DiscJuggler docs DiscJuggler — WORM emits ISO 9660 with CDI v2 footer
DMG R/W libdmg-hfsplus Apple disk image — WORM emits raw mish blocks per partition (no zlib/bz2/lzfse encoding); read-side handles all four compressions

Executable packers

CompressionWorkbench treats executable packers (UPX, demoscene compressors, classic DOS packers, modern PE protectors) as pseudo-archives — they get the same List / Extract interface as ZIP or TAR. Each descriptor surfaces a metadata.ini with detected evidence (version byte, signature offset, packer-header fields), an mz_header.bin / hunk_header.bin snapshot when applicable, and a packed_payload.bin (or an in-process decompressed body for UPX).

UPX (Ultimate Packer for eXecutables)

Detection is hardened against tampered binaries: the structural fingerprint (BSS-style first section + RWX flags + entry point in last section + payload entropy ≥ 7.5) catches binaries where the UPX! magic and section names have been wiped. A brute-force PackHeader scan validates format / method / uLen / cLen / level / version even when the magic bytes are zeroed.

Capability State Notes
Detection (canonical UPX) Yes Section names UPX0/UPX1/UPX2; UPX! packer-header magic; $Info: This file is packed with the UPX… tooling banner
Detection (tampered) Yes Brute-force PackHeader scan validates every field even with magic wiped. Structural fingerprint requires BSS-style first section (RawSize=0 + VirtualSize>0)
NRV2B_LE32 (method 2) Yes In-process via BB_Nrv2b
NRV2D_LE32 (method 3) Yes In-process via BB_Nrv2d (UCL-spec port)
NRV2E_LE32 (method 8) Yes In-process via BB_Nrv2e (UCL-spec port)
NRV2B_LE16 / LE8 (methods 4, 6) Yes Nrv2bBuildingBlock.DecompressRaw{Le16,Byte}
NRV2D_LE16 / LE8 (methods 5, 7) Yes Nrv2dBuildingBlock.DecompressRaw{Le16,Byte}
NRV2E_LE16 / LE8 (methods 9, 10) Yes Nrv2eBuildingBlock.DecompressRaw{Le16,Byte}
LZMA (method 14) Yes Via BB_Lzma
DEFLATE (method 15) No Rare in UPX output; deferred
PE header reconstruction No Surfaces decompressed_payload.bin as a raw blob; IAT reconstruction + OEP restoration is delegated to upx -d

Detection confidence is exposed as a 3-tier DetectionConfidence enum:

  • None — no evidence; descriptor's List / Extract throws so FormatDetector falls back to plain PE/ELF resource enumeration.
  • Heuristic — structural fingerprint match (BSS-style first section + RWX flags + entry in last section + high entropy) but no PackHeader.
  • Confirmed — PackHeader found (with or without intact magic), canonical section names present, or tooling banner intact.

The evidence record exposes every contributing signal so users can audit why a binary was flagged.

Demoscene and historical packers (detection-only)

Full decompression would require the original tool's runtime stub or a bespoke decompressor that has not been ported.

Packer Container Signature Reference
PKLITE DOS .exe PKLITE Copr. / PKlite Copr. in first 1 KB bp/pklite
LZEXE DOS .exe LZ91 / LZ09 signature in first 1 KB Bellard's page
Petite Win32 PE .petite* section name or Petite literal Un4seen Petite
Shrinkler Amiga HUNK HUNK magic (0x000003F3) + Shrinkler literal Blueberry's repo
FSG Win32 PE FSG! magic in first 16 KB x86asm forum
MEW Win32 PE Section name starting with .MEW / MEW Northfox page
MPRESS Win32 PE / Linux ELF .MPRESS1 / .MPRESS2 section or MPRESS / MATCODE literal MATCODE
Crinkler Win32 PE Crinkler / crinkler literal crinkler.net
kkrunchy Win32 PE kkrunchy literal Farbrausch
ASPack Win32 PE .aspack / .adata section or ASPack literal aspack.com
NsPack Win32 PE .nsp* section name or NsPack literal PEiD DB
Yoda's Crypter Win32 PE .yC / yC section or Yoda's literal Yoda's site
ASProtect Win32 PE ASProtect literal aspack.com/asprotect
Themida Win32 PE Themida / WinLicense literal Oreans
VMProtect Win32 PE .vmp* section or VMProtect literal vmpsoft.com

UCL-family building blocks

BB Description Status
BB_Nrv2b UCL NRV2B LE32 — LZ77 + interleaved variable-length integer bit stream Spec-faithful, UPX-compatible decoder
BB_Nrv2d UCL NRV2D LE32 — three-bit-per-iter offset varint + low-bit length tying Spec-faithful, UPX-compatible decoder
BB_Nrv2e UCL NRV2E LE32 — entropy-refined NRV2D variant Spec-faithful, UPX-compatible decoder
BB_Lzma LZMA dictionary compressor Pre-existing

Each UCL BB emits a 4-byte little-endian uncompressed-size header so the building block can round-trip standalone via IBuildingBlock.Compress / Decompress. Nrv2{b,d,e}BuildingBlock.DecompressRaw(compressed, exactOutputSize) helpers are available for callers reading bare streams (UPX payloads, OS/2 drivers, retro-computing collections) without the size header.


Tools

CompressionWorkbench exposes the same core library through five different surfaces. Pick the one that fits the task.

Compression.CLI — cwb

Universal archive tool with smart conversion, optimal re-encoding, benchmarking, and analysis built in.

Command Alias What it does
list <archive> l List contents of an archive
extract <archive> [files...] x Extract files from an archive
create <archive> <files...> c Create a new archive
test <archive> t Test archive integrity
info <archive> - Show detailed archive information
convert <input> <output> - Convert between archive formats
optimize <input> <output> opt Re-encode with optimal compression
benchmark <file> bench Benchmark all building blocks on the supplied data
analyze <file> - Run binary analysis (detection + entropy + trial decompress)
auto-extract <file> - Recursive nested extraction (see below)
batch <dir> - Scan a directory in parallel and aggregate format stats
suggest <file> - Platform-aware format recommendation
reverse-engineer <tool> reveng Black-box probing of an unknown compression tool
tool (init|list|add|run|remove) - Manage external-tool templates
formats - List all supported formats

Examples:

cwb list archive.zip
cwb extract archive.7z -o ./output
cwb x archive.rar -p mypassword
cwb create output.zip myDir file1.txt *.txt
cwb create output.7z file.txt --method lzma2+
cwb convert input.tar.gz output.tar.xz
cwb optimize input.zip optimized.zip
cwb benchmark largefile.bin
cwb analyze unknown.bin
cwb auto-extract sample.vhd --recursive
cwb suggest big.csv        # "→ consider zstd -19 (columnar/text, moderate entropy)"

3-tier conversion model. cwb convert picks the cheapest strategy that preserves data:

Tier Strategy Example
1 Bitstream transfer (zero decompression) .gz.zlib, .zip.gz
2 Container restream (decompress wrapper only) .tar.gz.tar.xz
3 Full recompress (extract + re-encode) .zip.7z

Method+ system. Append + to any method name for optimal encoding: deflate+ uses Zopfli, lzma+ uses Best, lz4+ uses HC.

Tool templates. cwb tool registers external CLI tools (7z, binwalk, file, trid, …) in ~/.cwb-tools.json. Templates use {input}, {output}, {outputDir} placeholders and can capture stdout, pipe stdin, or set a timeout. cwb tool init pre-populates templates for common tools.

Compression.UI — WPF browser + analyser + heatmap

The archive browser is the conventional half: file list with icons, columns (name, size, compressed, ratio, method, modified), open / extract / create / test flows, preview window (text + hex), properties dialog with compression-ratio visualisation, benchmark tool, and Explorer context-menu integration (Compression.Shell).

The analyser is the interesting half. When you drop an unknown binary on the UI, it never says "unsupported" — it shows you what the bytes look like. The Binary Analysis wizard has a toolbar that walks you through progressively deeper investigation:

  • Scan Results — every registered magic-byte signature that matches, with offsets and confidence.
  • Fingerprints — algorithm identification from byte-distribution and byte-pair statistics.
  • Entropy Map — per-region entropy profile with CUSUM change-point detection and 1D-Canny edge sharpening. Structured data (text, tables) shows low entropy; compressed/encrypted regions show high entropy; boundaries between them are marked.
  • Trial Decompress — runs every registered stream decompressor in parallel with per-trial timeout and early-terminates on a low-entropy output. If any decoder produces plausible output, it is offered for preview.
  • Chain — multi-layer compression reconstruction (e.g. gzip(bzip2(data))). Recursive trial decompression continues until entropy stops dropping.
  • Statistics — full byte distribution, bigram histogram, chi-square randomness test, longest run, run-length distribution.
  • Strings — ASCII / UTF-8 / UTF-16 string search with regex support.
  • Structure — ImHex/010-style .cwbt templates. Built-in templates ship for ZIP, PNG, BMP, ELF, Gzip; you can write your own using u8-u64 / i8-i64 / f16-f64 (LE/BE), char/u8 arrays, BCD, fixed-point, color, date/time, and network types with dynamic length via field references or repeat-to-EOF.

Heatmap Explorer

The Heatmap Explorer is the visual first pass. A 16×16 colour grid represents a proportional region of the file. Each of the 256 cells is one tile.

Cell colour Meaning Entropy
Blue Low entropy — zeros, padding, simple headers 0.0–3.0
Green Structured data — tables, records, text 3.0–5.5
Orange Compressed data 5.5–7.5
Red Random / encrypted (incompressible) 7.5–8.0
Purple A known format signature was detected here any

Click any cell to subdivide into another 16×16 grid — it recursively zooms in on a region. Hovering shows offset, size, entropy, unique-byte count, and the detected signature (if any). Extract on a purple cell saves just that region to a file. The explorer only samples each block, so it handles arbitrarily large files without loading them into memory. Accessible from the analyser tab or standalone at Tools → Heatmap Explorer.

Compression.Analysis — the analyser as a library

Everything the UI exposes is available as a .NET library under Compression.Analysis:

  • Signature Scanner — magic-byte detection for every registered format (hash-indexed, O(n)).
  • Algorithm Fingerprinting — statistical fingerprinting against known compression-output distributions.
  • Trial DecompressionTryAllAsync runs every registered stream decompressor in parallel with per-trial timeout and early termination.
  • Chain Reconstruction — discovers layered compression.
  • Entropy Mapping — per-region entropy profiling with boundary detection; multi-resolution entropy pyramid (64 KB / 8 KB / 1 KB / 256 B), CUSUM binary segmentation, KL-divergence + chi-square boundary validation, 1D-Canny edge sharpening.
  • String Extraction — ASCII / UTF-8 / UTF-16 with regex.
  • Structure Templates.cwbt template language.
  • Streaming Analysis — reads the first 64 KB for magic/header; computes entropy in 64 KB chunks; returns per-chunk entropy profiles for arbitrarily large files.
  • Black-box tool integrationExternalToolRunner, ToolOutputParser, CrossValidator, FallbackDecompressor with auto-discovery of tools on PATH.
  • AutoExtractor — recursive nested extraction: archives inside archives, disk images → partition tables → filesystems → files. Configurable max depth (default 5) and file-size limits.
  • BatchAnalyzer — parallel directory scan with aggregate format statistics.
  • PayloadCarver, StringsExtractor, EntropyHeatmap — standalone helpers.

Detection pipeline. Magic bytes → parallel trial decompression (early-termination on low-entropy output) → extension fallback → deep probe (header parse + structural validation + integrity check).

Partition table support. MbrParser (four primaries at 0x1BE + extended/logical chain) + GptParser (EFI PART at LBA 1) + PartitionTypeDatabase (type-byte / GUID → filesystem name). Recursive descent via --recursive: disk image → partition table → filesystem → archive chain.

Reverse Engineering

Two complementary flows for reverse-engineering unknown compression tools and file formats.

Black-box tool probing runs the target tool with ~40 controlled probe inputs (empty, single byte, incrementing patterns, text, random data, various sizes 0–64 KB), cross-correlates all outputs, and reports: magic bytes, size-field offsets (LE/BE, 2/4/8 byte), the compression algorithm (trial decompression against all 49 building blocks), filename storage (UTF-8 / UTF-16), determinism, payload entropy.

cwb reverse-engineer MyTool.exe "{input} {output}"
cwb reverse-engineer packer.exe "--pack {input} --out {output}" --timeout 10000

The GUI offers the same via Tools → Reverse Engineer Format as a step-by-step wizard with progress reporting.

Static analysis mode works when you have archive files with known original content but no tool to run. StaticFormatAnalyzer accepts pairs of (original, archived) and locates where the content appears inside the archive — verbatim or compressed with any known building block — then infers header/footer structure, size fields, and compression algorithm without ever executing an external tool.

Compression.Shell

Explorer context-menu integration. Right-click any file to invoke cwb commands directly: list, extract, test, optimise.

Compression.Sfx.Cli / Compression.Sfx.Ui

Self-extracting archive stubs for console and GUI use. The stub is a normal cwb-style reader prepended to an archive overlay; running the resulting exe extracts in place. Used for single-file distributions via Costura.Fody.


External tool validation

The test suite includes three tiers of external validation beyond the standard self-round-trip tests.

Self round-trip. All formats that support both create and extract are tested by creating an archive, extracting it, and verifying the output matches the original. Runs as part of the normal dotnet test.

External tool interop (Category=EndToEnd). Verifies our output is readable by external tools and vice versa. Dynamic tool discovery via PATH and common install locations; gracefully skips when tools are unavailable. Covered: 7z, gzip, bzip2, xz, zstd, lz4, tar. Both directions are tested: create with our library → read with external tool, and vice versa.

dotnet test --filter "Category=EndToEnd"

.NET BCL interop. Verifies interoperability with System.IO.Compression (GZipStream, DeflateStream, BrotliStream, ZipArchive).

OS integration (Category=OsIntegration). Platform-specific tooling:

  • Windows — PowerShell Compress-Archive / Expand-Archive, Windows tar, certutil, Mount-DiskImage, DISM
  • Linuxmtools (FAT), genisoimage (ISO), qemu-img (virtual disks), debugfs (ext4), cpio

Platform detection + Assert.Ignore means tests never fail due to missing prerequisites.

dotnet test --filter "Category=OsIntegration"

Filesystem validation matrix. Compression.Tests/ExternalFsInteropTests.cs wires 18 filesystem-image tests against the tools below:

Tool Present? Validates
7-Zip (portable) Bundled NTFS, FAT, exFAT, ext, HFS, HFS+, ISO 9660, UDF, SquashFS, CramFS (list/extract)
qemu-img Optional — install from https://qemu.weilnetz.de/w64/ VHD, VMDK, QCOW2, VDI (info + check)
DISM Windows built-in WIM, VHD, ISO
chkdsk Windows built-in (admin + mounted volume) FAT, exFAT, NTFS
mtools Optional — install from Cygwin FAT (non-admin)
WSL + mkfs.* / fsck.* Optional — wsl --install as admin + reboot ext / XFS / Btrfs / F2FS / JFS / ReiserFS / UDF / UFS
DOSBox-X + MS-DOS 6.0/6.2 Opt-in — set CWB_MSDOS_DBLSPACE_BOOT_IMG DBLSPACE CVF (DBLSPACE /CHKDSK D:) — see Compression.Tests/Support/MsDosImageStaging.md
DOSBox-X + MS-DOS 6.22 Opt-in — set CWB_MSDOS_DRVSPACE_BOOT_IMG DRVSPACE CVF (DRVSPACE /CHKDSK D:) — see Compression.Tests/Support/MsDosImageStaging.md
DOSBox-X + FreeDOS LiveCD Auto (hash-pinned download) FAT (CHKDSK D: from FreeDOS) — gate is [Explicit] because the LiveCD welcome screen races the autoexec

Tests skip cleanly when the tool is missing; they never fail the suite on a tool-deficient machine.


Architecture

Principles:

  1. No external compression code. Every algorithm is implemented from scratch in C#.
  2. Composable primitives. Compression.Core provides the building blocks; FileFormat.* / FileSystem.* projects compose them. Compression.Core never implements format interfaces — it is pure algorithm.
  3. Stream-oriented. All compression / decompression operates on System.IO.Stream.
  4. Immutable headers. File-format header structures are immutable record types.
  5. Testability. Every component is independently testable; NUnit tests cover primitives, format round-trips, and external interop.
  6. .NET 10 / C# 14. Latest language features, nullable reference types, warnings-as-errors.

Registry. The source generator (Compression.Registry.Generator) emits a RegisterFormats() method listing every IFormatDescriptor and a FormatDetector.Format enum with one entry per format — zero reflection, zero hand-maintained lists. The same mechanism discovers IBuildingBlock implementations in Compression.Core.


Building

dotnet build CompressionWorkbench.slnx

Testing

dotnet test

References to learn from

About

A fully clean-room C# implementation of compression primitives, archive file formats, and analysis tools.

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Contributors

Languages