A fully clean-room C# implementation of compression primitives, archive file formats, and analysis tools. Every algorithm is implemented from scratch using no external compression source code - only our own primitives.
CompressionWorkbench exists to answer two kinds of questions about compressed and packaged data, entirely in managed .NET with no native dependency on zlib, liblzma, libarchive, or any other third-party compression library:
- "What is this, and what is inside?" — given an arbitrary blob of bytes, identify the format, slice it into its logical payloads, and recover the original data.
- "How does the algorithm work, and how does it compare?" — provide a reference implementation of every major compression primitive, from LZ77 through arithmetic coding to modern neural / context-mixing compressors, so the algorithms can be read, benchmarked, and taught from a single codebase.
Concretely that means:
- Clean-room, from-scratch C#. Every primitive — bit I/O, Huffman, range coding, LZ family, BWT/MTF, PPM, context mixing, modern ANS/FSE — is written from the original specification or from a clean reverse of the reference algorithm. No line of native compression code is linked in or ported.
- Every common container, read and written wherever a spec exists to write against honestly. When the writer cannot match an external spec (proprietary element streams, missing on-disk structures), that is documented in the support tables instead of shipping a silent toy.
- Every multi-payload container treated as an archive. The distinction that matters to a user is "can I list and extract the N things inside?", not "is this called ZIP". That makes PE resource DLLs, multi-page TIFFs, font collections, multi-frame GIFs, PSD layer stacks, and MPEG transport streams all first-class archives — see Archives and Pseudo-archives below.
- Analysis as a first-class surface. Identification, entropy mapping, trial decompression, chain reconstruction, signature scanning, and cross-validation against external tools are exposed through a library (
Compression.Analysis), a CLI (cwb), and a UI visualiser — not as an afterthought. - Benchmarking at the primitive level. The benchmark compares the building blocks — raw algorithms without container overhead — so ratio/speed numbers reflect the algorithm, not the envelope.
- One library, many surfaces. CLI archiver (
cwb), UI browser + analyser, Explorer shell integration, self-extracting stubs (Compression.Sfx.*), and a library any .NET consumer can link.
Any format that packages N discrete, separately-addressable payloads is an archive.
A format earns archive treatment — the IArchiveFormatOperations contract (List / Extract / optional Create) — whenever its binary layout contains:
- A directory or index of named or indexed entries, and
- Each entry can be extracted as an independent blob, and
- A consumer might plausibly want one entry without the others.
This is true regardless of whether the entries happen to be files, images, pages, frames, tracks, layers, tables, fonts, strings, or other domain objects. The contents of an extracted blob remain domain-specific (a TIFF page is still a TIFF, an RT_ICON resource is still an icon), but that is a property of the payload, not of the container.
Formats in the canonical archive sense — ZIP, TAR, 7z, RAR, CAB, CPIO, and their relatives. They were designed as "a bag of files with a directory". These are covered in the Archive Formats, ZIP-Derived Containers, OLE2 Compound File Variants, Compound Formats, and Modern Packaging tables.
Formats that are archives by structure but have never been presented that way in ordinary file managers. CompressionWorkbench slices each one along its natural payload boundary and exposes the same List / Extract surface as ZIP.
| Container | Entries become | Where shipped |
|---|---|---|
| PE resource DLLs/EXEs | one entry per resource: RT_GROUP_ICON → .ico, RT_BITMAP → .bmp, RT_MANIFEST → .xml, RT_STRING → .txt, RT_VERSION → .rcv, raw RT_RCDATA |
FileFormat.PeResources, FileFormat.ResourceDll |
| ICO / CUR / ANI | one entry per ICONDIRENTRY → .png / .bmp (cursor adds hotspot) |
FileFormat.Ico, FileFormat.PngCrushAdapters.Ani |
| Multi-page TIFF / BigTIFF | one single-page .tif per IFD |
FileFormat.PngCrushAdapters.Tiff / BigTiff |
| Multi-frame GIF / MNG / FLI / DCX | one .gif / .png per frame |
FileFormat.Gif, PngCrushAdapters.{Mng,Fli,Dcx} |
| Animated PNG (APNG) | one .png per frame with dispose/blend applied against previous frames |
FileFormat.PngCrushAdapters.Apng |
| Icon containers (ICNS, MPO) | Apple icon suite / stereoscopic JPEG pair | FileFormat.PngCrushAdapters.{Icns,Mpo} |
| Font collections (TTC / OTC) | one .ttf / .otf per member font |
FileFormat.FontCollection |
| Single-font (TTF / OTF) | per-glyph entries (cmap + glyf slicing; CFF/OpenType passes through) | FileFormat.FontCollection.Ttf |
| Gettext MO / PO | one .txt per msgid/msgstr pair |
FileFormat.Gettext |
| WAV / FLAC / MP3 | full file + per-channel WAV + ID3v2/RIFF metadata + APIC cover art | FileFormat.Wav, FileFormat.Flac, FileFormat.Mp3 |
| Ogg | per-logical-stream packets + Vorbis/Opus comments | FileFormat.Ogg |
| MP4 / MOV / MKV / WebM | demuxed tracks (H.264 → Annex-B), attachments, chapters | FileFormat.Mp4, FileFormat.Matroska |
| MPEG Transport Stream | per-PID elementary streams (video/audio/data) | FileFormat.MpegTs |
| Blu-ray PGS (SUP) | subtitle segments grouped by epoch | FileFormat.Sup |
| VobSub (DVD) | .idx metadata + per-entry slices of the sibling .sub PES stream |
FileFormat.VobSub |
| HLS M3U8 | segment list with per-variant metadata | FileFormat.M3u8 |
| U-Boot uImage, FDT/DTB, UEFI FV | firmware header metadata + decompressed payload or per-FFS/property entries | FileFormat.UImage, FileFormat.Dtb, FileFormat.UefiFv |
| Device executable packers | the packer's metadata.ini (detection evidence) + packed_payload.bin (or in-process decompressed body for UPX) |
FileFormat.ExePackers |
Formats that cannot produce multiple addressable entries stay in FormatCategory.Stream rather than falsely advertising themselves as archives. IArchiveFormatOperations.List is free to return a single "whole payload" entry for stream-style containers (and does, for formats like PAQ8 or the audio-stream-as-archive descriptors), but a format that would have to fake an index has no business claiming SupportsMultipleEntries.
The solution uses the .slnx XML format. All projects sit at the repository root — no src/ or tests/ directories. Solution folders in the IDE group projects logically; the filesystem stays flat so git log --follow works on every file.
CompressionWorkbench.slnx
|
+-- Compression.Core Primitives, building blocks, SIMD, partition parsers
+-- Compression.Registry Interfaces (IFormatDescriptor, IBuildingBlock) + registries
+-- Compression.Registry.Generator Roslyn source generator for auto-discovery
+-- Compression.Lib Umbrella library: detection, archive ops, SFX hosting
+-- Compression.Analysis Binary analysis engine (signatures, entropy, trial decomp)
+-- Compression.CLI `cwb` command-line tool (System.CommandLine v3)
+-- Compression.UI WPF browser + analyser + heatmap + wizard
+-- Compression.Shell Explorer context-menu integration
+-- Compression.Sfx.Cli Self-extracting archive stub (console)
+-- Compression.Sfx.Ui Self-extracting archive stub (GUI)
+-- Compression.Tests NUnit test project (tests)
|
+-- FileFormat.* One project per archive / stream / pseudo-archive / packer
+-- FileSystem.* One project per filesystem image format
+-- Codec.* Standalone audio codecs (PCM/FLAC/A-law/µ-law/GSM/ADPCM/MIDI/MP3/Vorbis/Opus/AAC)
Adding a new format is a three-step process:
- Create a
FileFormat.<Name>/orFileSystem.<Name>/project with a class implementingIFormatDescriptor(andIStreamFormatOperationsorIArchiveFormatOperationsas appropriate). - Add a
<ProjectReference>fromCompression.Lib.csproj. - Add the project to
CompressionWorkbench.slnx.
The Roslyn source generator (Compression.Registry.Generator) discovers every implementation at compile time and emits the registration table. No reflection, no hand-maintained switch statements, no init hooks.
| Concern | Choice |
|---|---|
| Language | C# 14 / .NET 10 |
| Solution | .slnx (XML solution format) |
| Testing | NUnit |
| GUI | WPF |
| CLI | System.CommandLine v3 |
| Discovery | Roslyn source generator (zero-reflection format/block registration) |
| Bundling | Costura.Fody single-file embedding for CLI/UI/SFX |
| Level | Meaning |
|---|---|
| Unsupported | No descriptor exists. |
| Read-only | Can list and extract; no creation. |
| WORM | Write-Once-Read-Many — can produce a fresh archive/image, cannot modify one in place. |
| R/W | Can also add/replace/remove entries in an existing archive in place (no formats yet). |
In tables below, Yes = WORM (or better), - = Read-only. A Reference column links to the authoritative spec or reverse-engineering document the implementation was validated against.
The raw algorithm primitives registered via IBuildingBlock. They operate on ReadOnlySpan<byte> without any container framing — this is the surface the benchmark tool compares. Building blocks live in Compression.Core; they are never wrapped as FileFormat.* projects.
| Id | Name | Family | Description | Reference |
|---|---|---|---|---|
| BB_Deflate | DEFLATE | Dictionary | LZ77 + Huffman, the algorithm inside gzip/zip/png | RFC 1951 |
| BB_Deflate64 | Deflate64 | Dictionary | Enhanced DEFLATE with 64 KB window and extended codes | MS-ZIP spec |
| BB_Lz77 | LZ77 | Dictionary | Sliding-window dictionary with distance/length tokens | Ziv & Lempel 1977 paper |
| BB_Lz78 | LZ78 | Dictionary | Builds phrase dictionary from input, predecessor to LZW | Ziv & Lempel 1978 paper |
| BB_Lzw | LZW | Dictionary | Lempel-Ziv-Welch dictionary coding, used in GIF and Unix compress |
Welch 1984 paper |
| BB_Lzo | LZO1X | Dictionary | Extremely fast dictionary compression optimised for decompression speed | oberhumer.com |
| BB_Lzss | LZSS | Dictionary | LZ77 variant with flag-bit encoding | Storer & Szymanski 1982 |
| BB_Lz4 | LZ4 | Dictionary | Extremely fast LZ77-family block compression | LZ4 block format |
| BB_Snappy | Snappy | Dictionary | Fast LZ77-family compression (Google) | Snappy format |
| BB_Brotli | Brotli | Dictionary | Modern LZ77 + Huffman with static dictionary (Google) | RFC 7932 |
| BB_Lzma | LZMA | Dictionary | Lempel-Ziv-Markov chain with range coding | 7-Zip LZMA SDK |
| BB_Lzx | LZX | Dictionary | LZ77 + Huffman used in CAB/CHM/WIM | MS-PATCH LZX spec |
| BB_Xpress | XPRESS Huffman | Dictionary | Windows XPRESS (NTFS/WIM/Hyper-V) | MS-XCA spec |
| BB_Lzh | LZH (LH5) | Dictionary | Lempel-Ziv with adaptive Huffman, used in LHA | LZH format doc |
| BB_Arj | ARJ | Dictionary | Modified LZ77 + Huffman used in ARJ archives | ARJ technical info |
| BB_Lzms | LZMS | Dictionary | LZ + Markov + Shannon with delta matching (Windows WIM/ESD) | MS-XCA LZMS |
| BB_Lzp | LZP | Dictionary | Lempel-Ziv Prediction, context-based match prediction | Bloom 1996 |
| BB_Ace | ACE | Dictionary | LZ77 + Huffman from ACE archive format | unace-nonfree |
| BB_Rar | RAR5 | Dictionary | LZ + Huffman + PPM from RAR5 | rarlab technote |
| BB_Sqx | SQX | Dictionary | LZ + Huffman from the SQX archive format | SqxFormat notes |
| BB_ROLZ | ROLZ | Dictionary | Reduced-Offset LZ with context-based match tables | encode.su discussion |
| BB_PPM | PPM | Context Mixing | Prediction by Partial Matching, order-2 context modelling | Cleary & Witten 1984 |
| BB_CTW | CTW | Context Mixing | Context Tree Weighting — optimal universal compression | Willems 1995 paper |
| BB_LZHAM | LZHAM | Dictionary | LZ77 + Huffman, inspired by Valve's LZHAM codec | LZHAM repo |
| BB_Lzs | LZS | Dictionary | Stac LZS (7/11-bit offset LZSS for networking) | RFC 1967 / RFC 2395 |
| BB_Lzwl | LZWL | Dictionary | LZW with variable-length initial alphabet from digram analysis | LZWL paper |
| BB_RePair | Re-Pair | Dictionary | Recursive Pairing, offline grammar-based compression | Larsson & Moffat 1999 |
| BB_842 | 842 | Dictionary | IBM 842 hardware compression with 2/4/8-byte template matching | Linux crypto/842* |
| BB_Huffman | Huffman | Entropy | Optimal prefix-free entropy coding using symbol frequencies | Huffman 1952 |
| BB_Arithmetic | Arithmetic | Entropy | Order-0 arithmetic coding with frequency table | Witten, Neal & Cleary 1987 |
| BB_ShannonFano | Shannon-Fano | Entropy | Historical predecessor to Huffman, recursive frequency splitting | Shannon 1948 |
| BB_Golomb | Golomb/Rice | Entropy | Optimal coding for geometric distributions | Golomb 1966 |
| BB_Fibonacci | Fibonacci coding | Entropy | Universal code using Zeckendorf representation with 11 terminators |
Apostolico & Fraenkel 1987 |
| BB_FSE | FSE/tANS | Entropy | Table-based Asymmetric Numeral Systems, used in Zstd | Duda 2013 paper / Collet's blog |
| BB_BPE | Byte Pair Encoding | Entropy | Iterative most-frequent pair replacement | Gage 1994 |
| BB_RangeCoding | Range coding | Entropy | Byte-oriented arithmetic coding variant with carryless normalisation | Martin 1979 |
| BB_rANS | rANS | Entropy | Range ANS coder, used in AV1 and LZFSE | Duda 2013 paper |
| BB_ExpGolomb | Exp-Golomb | Entropy | Exponential Golomb, used in H.264/H.265 | Teuhola 1978 |
| BB_Unary | Unary | Entropy | Simplest universal code: N ones followed by a zero | — |
| BB_EliasGamma | Elias gamma | Entropy | Universal code using unary length prefix | Elias 1975 |
| BB_EliasDelta | Elias delta | Entropy | Gamma-codes the bit length | Elias 1975 |
| BB_Levenshtein | Levenshtein coding | Entropy | Self-delimiting universal code with recursive length prefixing | Levenshtein 1968 |
| BB_Tunstall | Tunstall coding | Entropy | Variable-to-fixed code, dual of Huffman | Tunstall 1967 (PhD thesis) |
| BB_Dmc | DMC | Entropy | Dynamic Markov Compression, bit-level FSM with state cloning | Cormack & Horspool 1987 |
| BB_Bwt | BWT | Transform | Burrows-Wheeler Transform, reorders bytes for better compression | Burrows & Wheeler 1994 |
| BB_Mtf | MTF | Transform | Move-to-Front Transform | Bentley et al. 1986 |
| BB_Delta | Delta | Transform | Delta filter, stores differences between consecutive bytes | — |
| BB_Rle | RLE | Transform | Run-Length Encoding | — |
| BB_Dpcm | DPCM | Transform | Differential PCM, stores sample-to-sample differences | — |
The benchmark command (cwb benchmark <file> or the UI's Benchmark Tool) runs every building block over the supplied data, records ratio + compress/decompress times, and ranks the results.
| Format | Extensions | Read | Write | Reference | Notes |
|---|---|---|---|---|---|
| ZIP | .zip |
Yes | Yes | APPNOTE.TXT | Store, Deflate, Deflate64, Shrink, Reduce, Implode, BZip2, LZMA, PPMd, Zstd, AES |
| RAR | .rar |
Yes | Yes (v4/v5) | rarlab technote | v1-v5 decoders, solid, multi-volume, encryption, recovery |
| 7z | .7z |
Yes | Yes | 7-Zip format | LZMA/LZMA2, Deflate, BZip2, PPMd, BCJ/BCJ2, AES-256, multi-volume |
| TAR | .tar |
Yes | Yes | POSIX ustar | POSIX/GNU/PAX, multi-volume |
| CAB | .cab |
Yes | Yes | MS-CAB | MSZIP, LZX, Quantum |
| LZH/LHA | .lzh,.lha |
Yes | Yes | LHA archive format | lh0-lh7, lzs, lh1-lh3 (adaptive Huffman), pm0-pm2 |
| ARJ | .arj |
Yes | Yes | ARJ technical | Methods 0-4, garble encryption |
| ARC | .arc |
Yes | Yes | ARC format | Methods 0-9 (RLE, LZW, Squeeze, Huffman) |
| ZOO | .zoo |
Yes | Yes | zoo format | LZW, LZH |
| ACE | .ace |
Yes | Yes | ACE unofficial spec | ACE 1.0/2.0, solid, sound/picture filters, Blowfish, recovery |
| SQX | .sqx |
Yes | Yes | SQX disassembly | LZH, multimedia, audio, solid, AES-128, recovery |
| CPIO | .cpio |
Yes | Yes | cpio(5) | Binary, odc, newc, CRC |
| AR | .ar |
Yes | Yes | ar(5) | Unix archive |
| WIM | .wim |
Yes | Yes | Imagex WIM format | LZX, XPRESS |
| RPM | .rpm |
Yes | Yes | RPM spec | CPIO payload |
| DEB | .deb |
Yes | Yes | deb(5) | AR+TAR with gz/xz/zst/bz2 |
| Shar | .shar |
Yes | Yes | GNU sharutils | Shell archive |
| PAK | .pak |
Yes | Yes | PAK spec | ARC-compatible |
| HA | .ha |
Yes | Yes | HA specification | HSC/ASC arithmetic coding |
| ZPAQ | .zpaq |
Yes | Yes | ZPAQ spec PDF | Context mixing, journaling |
| StuffIt | .sit |
Yes | Yes | libxad sit.c | Multiple methods |
| StuffIt X | .sitx |
Yes | Yes | XADMaster StuffItX | Detection-only; WORM emits a valid StuffIt! envelope (proprietary element-stream writer not implemented) |
| SquashFS | .sqfs |
Yes | Yes | SquashFS 4.0 spec | Filesystem image |
| CramFS | .cramfs |
Yes | Yes | Linux fs/cramfs/ |
Filesystem image |
| NSIS | .exe |
Yes | Yes | NSIS wiki | Installer extraction + WORM emits overlay-only data (no PE stub) |
| Inno Setup | .exe |
Yes | Yes | innounp | Installer extraction + WORM emits signature header (no PE stub) |
| DMS | .dms |
Yes | Yes | xDMS source | Amiga disk archiver |
| LZX (Amiga) | .lzx |
Yes | Yes | Amiga LZX format | Amiga LZX |
| Compact Pro | .cpt |
Yes | Yes | XADMaster cpt.c | Classic Mac format |
| Spark | .spark |
Yes | Yes | RISC OS Spark | RISC OS format |
| LBR | .lbr |
Yes | Yes | CP/M LBR | CP/M format |
| UHARC | .uha |
Yes | Yes | UHARC docs | LZP compression |
| WAD (Doom) | .wad |
Yes | Yes | Doom Wiki WAD | Doom WAD format |
| WAD2/WAD3 | .wad |
Yes | Yes | Quake Wiki WAD | Quake/Half-Life texture archive |
| XAR | .xar |
Yes | Yes | XAR on-disk format | Apple .pkg (zlib TOC) |
| ALZip | .alz |
Yes | Yes | ALZ format | Korean archive (Deflate) |
| VPK | .vpk |
Yes | Yes | Valve VPK | Valve game archive |
| BSA/BA2 | .bsa,.ba2 |
Yes | Yes | BSA format | Bethesda game archive |
| MPQ | .mpq |
Yes | Yes | ZezulaMPQ docs | Blizzard — WORM v1 with stored entries, encrypted hash+block tables, self-referential (listfile) |
| GRP | .grp |
Yes | Yes | BUILD Engine docs | BUILD Engine (Duke Nukem 3D) |
| HOG | .hog |
Yes | Yes | Descent HOG | Descent game archive |
| BIG | .big |
Yes | Yes | EA BIG format | EA Games (C&C, FIFA) |
| Godot PCK | .pck |
Yes | Yes | Godot PCK spec | Godot Engine resource pack |
| WARC | .warc |
Yes | Yes | ISO 28500 | Web archive — WORM emits one resource record per input file |
| NDS | .nds |
Yes | Yes | GBATEK NDS | Nintendo DS ROM — WORM emits valid NitroFS (no ARM9/ARM7 boot code) |
| NSA | .nsa |
Yes | Yes | NScripter docs | NScripter — WORM writes stored entries (compression type 0) |
| SAR | .sar |
Yes | Yes | NScripter docs | NScripter — uncompressed variant of NSA |
| PackIt | .pit |
Yes | - | XADMaster packit.c | Classic Mac format |
| DiskDoubler | .dd |
Yes | Yes | XADMaster DD | Classic Mac compression — WORM stores data fork (method 0) |
| MSI | .msi |
Yes | Yes | MS-CFB | OLE Compound File — WORM produces a CFB envelope (not a functional Installer DB) |
.pdf |
Yes | Yes | ISO 32000 | Image extraction + WORM via file attachments (EmbeddedFiles) — any file type round-trips | |
| TNEF | .tnef,.dat |
Yes | Yes | MS-OXTNEF | Outlook winmail.dat |
| Split File | .001 |
Yes | Yes | — | Multi-part file joining/splitting |
| FreeArc | .arc |
Yes | Yes | FreeArc source | FreeArc archive |
| CHM | .chm |
Yes | Yes | CHM file format | MS Compiled HTML Help — WORM stores files in section 0 (uncompressed); LZX compression available via options |
| Wrapster | - | Yes | - | XADMaster wrapster.c | MP3 wrapper archive |
| LhF | .lhf |
Yes | Yes | XADMaster | Amiga LhFloppy disk (LZH-compressed tracks) |
| ZAP | .zap |
Yes | Yes | XADMaster | Amiga disk archiver — WORM writes stored tracks |
| PackDisk | .pdsk |
Yes | Yes | XADMaster | Amiga PackDisk — WORM writes stored tracks. Same writer covers DCS / xDisk / xMash via different magics. |
| AMPK | - | Yes | - | XADMaster | Amiga AMPK |
| IFF-CDAF | - | Yes | - | IFF spec | IFF-CDAF archive |
| UMX | .umx |
Yes | Yes | Beyond Unreal wiki | Unreal package — WORM emits valid header (detection-only) |
All delegate to the ZIP reader/writer. WORM (Yes) means a fresh container can be produced with the correct internal layout.
| Format | Extensions | Read | Write | Reference | Notes |
|---|---|---|---|---|---|
| JAR | .jar |
Yes | Yes | JAR spec | Java archive |
| WAR | .war |
Yes | Yes | Java EE WAR | Java web archive |
| EAR | .ear |
Yes | Yes | Java EE EAR | Java enterprise archive |
| APK | .apk |
Yes | Yes | Android APK | Android package |
| IPA | .ipa |
Yes | Yes | Apple IPA bundle | iOS package |
| APPX | .appx,.msix |
Yes | Yes | MS-APPXPKG | Windows package |
| XPI | .xpi |
Yes | Yes | Mozilla XPI | Firefox extension |
| CRX | .crx |
Yes | Yes | Chrome CRX3 | Chrome extension — WORM emits unsigned CRX3 envelope (browser rejects signature) |
| EPUB | .epub |
Yes | Yes | EPUB 3 spec | eBook |
| MAFF | .maff |
Yes | Yes | MAFF spec | Mozilla Archive Format |
| KMZ | .kmz |
Yes | Yes | KML spec | Google Earth |
| NuPkg | .nupkg |
Yes | Yes | NuGet spec | NuGet package |
| DOCX | .docx |
Yes | Yes | ECMA-376 | OOXML Word |
| XLSX | .xlsx |
Yes | Yes | ECMA-376 | OOXML Excel |
| PPTX | .pptx |
Yes | Yes | ECMA-376 | OOXML PowerPoint |
| ODT | .odt |
Yes | Yes | OASIS ODF | OpenDocument Text |
| ODS | .ods |
Yes | Yes | OASIS ODF | OpenDocument Spreadsheet |
| ODP | .odp |
Yes | Yes | OASIS ODF | OpenDocument Presentation |
| CBZ | .cbz |
Yes | Yes | Comic book archive | Comic book ZIP |
| CBR | .cbr |
Yes | Yes | Comic book archive | Comic book RAR — delegates to RarWriter |
Microsoft binary-office formats built on the OLE2 / Compound File Binary (CFB) container. WORM creation produces a structurally-valid CFB envelope (that round-trips through our reader and other permissive CFB tools like libgsf / Apache POI) but is not a real Word/Excel/PowerPoint/Outlook document — those require generating each application's internal binary stream layout, which is out of scope. Limitations: ~6.8 MB total file size (109 FAT sectors, no DIFAT chain), single root storage, stream names ≤ 31 UTF-16 chars.
| Format | Extensions | Read | Write | Reference | Notes |
|---|---|---|---|---|---|
| DOC | .doc |
Yes | Yes | MS-DOC | Word 97-2003 (CFB envelope, not a real Word document) |
| XLS | .xls |
Yes | Yes | MS-XLS | Excel 97-2003 (CFB envelope, not a real workbook) |
| PPT | .ppt |
Yes | Yes | MS-PPT | PowerPoint 97-2003 (CFB envelope, not a real presentation) |
| MSG | .msg |
Yes | Yes | MS-OXMSG | Outlook message (CFB envelope, not real MAPI properties) |
| Thumbs.db | Thumbs.db |
Yes | Yes | Forensics docs | Windows thumbnail cache (CFB envelope, not real Catalog layout) |
| MSI | .msi |
Yes | Yes | MS-MSI | Windows Installer (CFB envelope, not a functional Installer DB) |
Single-stream compressors. Compress/Decompress indicate the two halves of the algorithm.
| Format | Extensions | Compress | Decompress | Reference |
|---|---|---|---|---|
| Gzip | .gz |
Yes | Yes | RFC 1952 |
| BZip2 | .bz2 |
Yes | Yes | bzip2 source |
| XZ | .xz |
Yes | Yes | XZ format |
| Zstandard | .zst |
Yes | Yes | RFC 8878 |
| LZ4 | .lz4 |
Yes | Yes | LZ4 frame format |
| Brotli | .br |
Yes | Yes | RFC 7932 |
| Snappy | .sz,.snappy |
Yes | Yes | Snappy framing |
| LZOP | .lzo |
Yes | Yes | lzop source |
| compress (.Z) | .Z |
Yes | Yes | ncompress |
| LZMA | .lzma |
Yes | Yes | 7-Zip LZMA SDK |
| Lzip | .lz |
Yes | Yes | lzip format |
| Zlib | .zlib |
Yes | Yes | RFC 1950 |
| SZDD | .sz_ |
Yes | Yes | compress.exe format |
| KWAJ | - | Yes | Yes | MS compress formats |
| RZIP | .rz |
Yes | Yes | rzip docs |
| MacBinary | .bin |
Yes | Yes | RFC 1740 |
| BinHex | .hqx |
Yes | Yes | RFC 1741 |
| Squeeze | .sqz |
Yes | Yes | Squeeze format |
| PowerPacker | .pp |
Yes | Yes | Amiga PP20 |
| ICE Packer | .ice |
Yes | Yes | Atari ST ICE |
| PackBits | .packbits |
Yes | Yes | Apple PackBits |
| Yaz0 (SZS) | .yaz0,.szs |
Yes | Yes | Nintendo Yaz0 RE |
| BriefLZ | .blz |
Yes | Yes | BriefLZ source |
| RNC | .rnc |
Yes | Yes | Rob Northen RE |
| RefPack / QFS | .qfs,.refpack |
Yes | Yes | RefPack RE |
| aPLib | .aplib |
Yes | Yes | aPLib docs |
| LZFSE | .lzfse |
Yes | Yes | Apple LZFSE source |
| Freeze | .f,.freeze |
Yes | Yes | Unix Freeze |
| uuencoding | .uu,.uue |
Yes | Yes | POSIX uuencode |
| yEnc | .yenc |
Yes | Yes | yEnc spec |
| Density | .density |
Yes | Yes | Density source |
| LZG | .lzg |
Yes | Yes | LZG source |
| BCM | .bcm |
Yes | Yes | BCM source |
| BSC | .bsc |
Yes | Yes | libbsc |
| BALZ | .balz |
Yes | Yes | BALZ source |
| CSC | .csc |
Yes | Yes | CSC source |
| Zling | .zling |
Yes | Yes | libzling |
| Lizard | .lizard |
Yes | Yes | Lizard source |
| QuickLZ | .quicklz |
Yes | Yes | QuickLZ docs |
| cmix | .cmix |
Yes | Yes | cmix source |
| MCM | .mcm |
Yes | Yes | MCM source |
| PAQ8 | .paq8 |
Yes | Yes | Matt Mahoney PAQ page |
| SWF | .swf |
Yes | Yes | SWF 19 spec |
| CP/M Crunch | .cru |
Yes | Yes | CP/M CRUNCH |
| PPMd | .pmd |
Yes | Yes | Shkarin PPMd |
| LZHAM | .lzham |
Yes | Yes | LZHAM source |
| LZS | .lzs |
Yes | Yes | RFC 1967 / RFC 2395 |
| FLAC | .flac |
Yes | Yes | FLAC format |
tar.gz, tar.bz2, tar.xz, tar.zst, tar.lz4, tar.lz, tar.br — auto-detected, both read and write.
| Format | Extensions | Read | Write | Reference | Notes |
|---|---|---|---|---|---|
| AppImage | .AppImage |
Yes | - | AppImage spec | ELF stub + appended SquashFS; offset located by ELF section-end + magic scan |
| Snap | .snap |
Yes | - | snapd source | SquashFS with meta/snap.yaml |
| MSIX | .msix,.msixbundle |
Yes | - | MSIX spec | Modern Windows app package (mirrors APPX) |
| ESD | .esd |
Yes | - | WIM/ESD overview | Windows Update encrypted-LZMS WIM; shares MSWIM\0\0\0 magic, extension-only |
| Split WIM | .swm,.swmN |
Yes | - | WIM spec | Multi-part WIM volume |
| WACZ | .wacz |
Yes | - | WACZ 1.0.0 | Web Archive Collection Zipped — ZIP around WARC + datapackage.json |
| Python Wheel | .whl |
Yes | - | PEP 427 | ZIP with dist-info/METADATA, WHEEL, RECORD |
| Ruby Gem | .gem |
Yes | - | gem spec | TAR with metadata.gz, data.tar.gz, checksums.yaml.gz |
| Rust Crate | .crate |
Yes | - | cargo spec | TAR.GZ with single name-version/ directory containing Cargo.toml |
| Format | Extensions | Read | Write | Reference | Notes |
|---|---|---|---|---|---|
| U-Boot uImage | .uimg,.img,.bin |
Yes | - | U-Boot image.h | 64-byte legacy header + body; reports OS/arch/comp; decompresses payload when possible |
| Device Tree Blob | .dtb,.dtbo |
Yes | - | DT spec | FDT v17, walks property tree as pseudo-archive |
| Intel HEX | .hex,.ihex |
Yes | - | Intel HEX spec | ASCII firmware records, decoded to flat firmware.bin + metadata |
| Motorola S-Record | .s19,.s28,.s37,.srec,.mot |
Yes | - | SREC spec | 16/24/32-bit address records |
| TI-TXT | - | Yes | - | MSP430 programming | MSP430 firmware text, address blocks |
| UEFI Firmware Volume | .fv,.fd,.rom,.bin |
Yes | - | UEFI PI vol.3 | _FVH at offset 40, walks FFS files |
| Format | Extensions | Read | Write | Reference | Notes |
|---|---|---|---|---|---|
| VHDX | .vhdx |
Yes | - | MS VHDX spec | Hyper-V modern; surfaces File Type ID + 2 headers + 2 region tables (BAT walk deferred) |
| EWF/E01 | .e01,.ewf,.l01 |
Yes | - | libewf docs | EnCase forensic image; section-chain walker, header2 + MD5/SHA1 |
| G64 | .g64 |
Yes | - | VICE G64 docs | Commodore GCR-encoded track dump (1541) |
| NIB | .nib |
Yes | - | nibtools docs | Commodore raw nibble track dump |
| Format | Extensions | Read | Write | Reference | Notes |
|---|---|---|---|---|---|
| NumPy NPY | .npy |
Yes | - | NEP 1 / npy-format | Single ndarray header + raw bytes |
| NumPy NPZ | .npz |
Yes | - | savez docs | ZIP of NPYs |
| NIfTI-1/2 | .nii,.nii.gz |
Yes | - | NIfTI spec | Medical imaging (MRI); 352-byte v1 / 540-byte v2 header + voxel data; transparent gzip |
| HDF4 | .hdf,.hdf4,.h4 |
Yes | - | HDF4 reference | DD linked-list walker, tag histogram + per-DD entry |
| ONNX | .onnx |
Yes | - | ONNX proto | Pure-C# protobuf reader; surfaces graph initializers as entries |
| Format | Extensions | Read | Write | Reference | Notes |
|---|---|---|---|---|---|
| STL | .stl |
Yes | - | STL spec | ASCII + binary; triangle count, bounding box, name |
| PLY | .ply |
Yes | - | Stanford PLY | ASCII / binary LE/BE, element schema |
| DXF | .dxf |
Yes | - | Autodesk DXF ref | AutoCAD ASCII; section list + entity histogram |
| Collada | .dae |
Yes | - | Khronos Collada 1.5 | XML 3D interchange |
| 3DS | .3ds |
Yes | - | lib3ds docs | Autodesk binary chunks |
| Format | Extensions | Read | Write | Reference | Notes |
|---|---|---|---|---|---|
| DICOM | .dcm |
Yes | - | NEMA DICOM PS3 | Single DICOM image |
| DICOMDIR | .dcmdir,DICOMDIR |
Yes | - | DICOM PS3.10 | Multi-study patient/series index referencing sibling DICOM files |
| Format | Extensions | Read | Write | Reference | Notes |
|---|---|---|---|---|---|
| SUP (PGS) | .sup |
Yes | - | PGS RE doc | Blu-ray Presentation Graphic Stream subtitle segments, grouped by epoch |
| VobSub | .idx + .sub |
Yes | - | MPlayer vobsub | DVD subtitle pair; parses .idx palette/timestamps + slices sibling .sub PES |
| HLS M3U8 | .m3u8,.m3u |
Yes | - | RFC 8216 | HTTP Live Streaming manifest |
| MPEG-TS | .ts,.m2ts,.mts |
Yes | - | ITU-T H.222.0 | MPEG-2 Transport Stream demuxed into per-PID elementary streams |
Standalone audio codecs live under Codec.* projects (separate from container-format descriptors). Each exposes a static Decompress(Stream input, Stream output) producing interleaved little-endian PCM and a ReadStreamInfo for metadata-only access. Encoders are explicitly out of scope for the new codecs — only the legacy ones ship encoders.
| Codec | Project | Encoder | Decoder state | Reference |
|---|---|---|---|---|
| PCM | Codec.Pcm |
Yes | Production — raw integer PCM up to 32-bit | — |
| FLAC | Codec.Flac |
Yes | Production — FIXED + LPC subframes, all sample rates / bit depths | xiph.org/flac |
| A-law | Codec.ALaw |
Yes | Production — G.711 | ITU-T G.711 |
| μ-law | Codec.MuLaw |
Yes | Production — G.711 | ITU-T G.711 |
| GSM 06.10 | Codec.Gsm610 |
Yes | Production — full RPE-LTP | ETSI GSM 06.10 |
| IMA ADPCM | Codec.ImaAdpcm |
Yes | Production — Microsoft + Apple variants | IMA ADPCM spec |
| MS ADPCM | Codec.MsAdpcm |
Yes | Production — WAV format 0x0002 | MS ADPCM spec |
| MIDI | Codec.Midi |
Yes | Production — SMF 0/1/2 with all standard meta + channel events | MIDI 1.0 spec |
| MP3 | Codec.Mp3 |
- | Header + framing complete; bit-exact decode unverified — minimp3 port (1469 LOC, scalar) covering MPEG-1/2/2.5 Layer III, MS+intensity stereo, ID3v2 skip, Xing VBR. Layer I/II rejection passes. End-to-end PCM decode against a reference clip is deferred until an MP3 test vector lands in test-corpus/. |
ISO/IEC 11172-3 / minimp3 |
| Vorbis | Codec.Vorbis |
- | Partial — stb_vorbis structural port (1295 LOC) covering Ogg page reassembly, codebooks (lookup 0/1/2), floor 1, residue 0/1/2, channel coupling, IMDCT. Floor 0 throws NotSupportedException. End-to-end test marked Inconclusive until a test vector lands in test-corpus/. |
Vorbis I spec |
| Opus | Codec.Opus |
- | Framing only — Ogg page walker + OpusHead/OpusTags + TOC byte + frame packing modes 0/1/2/3 + range decoder (ec_dec) all real. CELT and SILK pipelines are stubs that emit silence at the correct sample count. Hybrid mode throws NotSupportedException. |
RFC 6716 |
| AAC-LC | Codec.Aac |
- | Framing only — ADTS frame parser + AudioSpecificConfig + element dispatcher + profile gating real. Spectral pipeline + Huffman tables + IMDCT + filterbank scaffolded but spectral data tables are TODO. HE-AAC v1/v2 + Main/SSR/LTP/ER all throw NotSupportedException. |
ISO/IEC 14496-3 |
Implementation philosophy. The four new audio codecs (MP3 / Vorbis / Opus / AAC-LC) ship under the project's "no toy implementations" rule — partial state is documented openly (in class summaries, in Assert.Ignore messages, and in this table) rather than silently producing wrong PCM. Future work: bit-pack debugging for MP3, real CELT/SILK for Opus, spectral table population for AAC, reference test-vector validation across all four.
Code paths that throw NotSupportedException or NotImplementedException rather than silently producing wrong output. Documented here so expectations match behaviour.
| Area | State |
|---|---|
| MP3 / Vorbis / Opus / AAC-LC | Partial decoder state — see Audio Codecs table. MP3 bit-exact needs test vectors; Vorbis floor 0 throws (obsolete since 2004 — stb_vorbis doesn't implement it either); Opus CELT/SILK + AAC spectral filterbank are multi-week DSP projects |
| LZFSE V1 / V2 blocks | FSE/tANS backend not implemented — uncompressed (bvxn) + LZVN blocks work. Full LZFSE needs ~1500 LOC new code (Apple reference impl) |
| ZPAQ | Reader requires a ZPAQL virtual machine (not implemented). Multi-week bytecode-VM project |
| StuffIt X writer | Proprietary element-catalog / P2-varint writer not implemented — WORM emits valid StuffIt! envelope shell. No public spec, only reverse-engineering notes |
| UMX writer | Full export table + compact-index music encoding not implemented — WORM emits valid header only |
| OLE2 application streams (DOC/XLS/PPT/MSG/ThumbsDb/MSI) | CFB envelope round-trips through our reader and libgsf/Apache POI, but the internal WordDocument / WorkBook / PowerPoint Document / MAPI / Catalog / Installer-DB streams are not synthesised. Each is a 400+ page MS Open Specification |
| Inno Setup reader | Individual file extraction from Setup.1 not implemented for some installer versions |
| EROFS | Compressed layouts (LZ4/LZMA-compressed inodes) not decompressed — plain-storage inodes work |
| ExtRemover | Indirect-block traversal not implemented for file removal (direct blocks work) |
| F2FS writer | Indirect-block allocation not implemented — per-file max ≈ 3.6 MB (923 direct pointers in inode, no direct_node/indirect_node chain) |
| RAR create | Only v4 and v5 archive creation are implemented |
Fixes landed in this pass (documented here so the list above is what's actually pending):
| Area | Before | After |
|---|---|---|
| CAB LZX | Enum marked "(not implemented)" | Comment was stale — reader (LzxDecompressor) and writer (BB_Lzx) were already wired |
| MPQ bzip2 (method 0x10) | Returned payload raw | Now invokes Bzip2Stream on the payload, falls back to raw on decode failure |
| FAT32 writer | Threw NotSupportedException for images ≥ 65525 clusters |
Full FATGEN103-compliant FAT32: extended BPB (BPB_RootClus/BPB_FSInfo/BPB_BkBootSec), FSInfo sector with the three canonical signatures, backup boot sector at sector 6, cluster-2 root directory with EoC marker, 32 reserved sectors, FS-type string FAT32 |
| ProDOS tree storage | Files > 128 KB rejected outright | Writer emits storage-type-3 trees: master index block + up to 256 subordinate index blocks → 32 MB per file. Reader already handled type 3 |
Snapshot: 41 filesystems, 37 read+write, 4 read-only. Spec = the external document/source the writer was validated against.
| FS | State | Spec | Notes |
|---|---|---|---|
| FAT12/16/32 | R/W | Microsoft FATGEN | Full BPB, 0x55 0xAA signature, auto-select FAT12/16/32 by cluster count. FAT32 includes extended BPB, FSInfo sector + backup boot sector + cluster-2 root directory |
| exFAT | R/W | Microsoft exFAT spec | Full VBR, boot-checksum sector (§3.1.3), Upcase/Bitmap/VolumeLabel root entries |
| NTFS | R/W | MS-NT on-disk / TSK docs | All 16 system MFT files, USA fixup, LZNT1 compression |
| DoubleSpace / DriveSpace CVF | R/W | MS-DOS 6 Technical Reference | Full MDBPB with DBLS/DVRS signature, MDFAT + BitFAT, inner FAT12/16 with VFAT LFN. Stored runs only (JM/LZ77 is TODO) |
| HPFS | RO | OS/2 Inside Story | Read-only descriptor (no writer) |
| FS | State | Spec | Notes |
|---|---|---|---|
| ext2/3/4 | R/W | Linux kernel fs/ext2/ext2.h |
Spec-compliant, random UUID; FS revision 0 GOOD_OLD |
| XFS v5 | R/W | Linux fs/xfs/libxfs/xfs_format.h |
v5 with v3 dinodes, sb_crc CRC-32C, sb_features_*/sb_meta_uuid/sb_pquotino |
| JFS | R/W | Linux fs/jfs/jfs_superblock.h |
pxd_t bit-packing (24-bit length + 40-bit address), inline dtree root, aggregate inode table with FILESYSTEM_I=16 |
| ReiserFS 3.6 | R/W | Linux fs/reiserfs/reiserfs.h |
Spec-correct offsets, ReIsEr2Fs @+52, leaf block-head. No block CRC (v3.6 doesn't have them) |
| F2FS | R/W | Linux include/linux/f2fs_fs.h |
Superblock magic at block-offset 0x400, CP + SIT + NAT + SSA + Main, CRC-32C, inline dentries in root inode |
| Btrfs | R/W | Btrfs on-disk format | Real chunk tree (SYSTEM/METADATA/DATA), sys_chunk_array in superblock, DEV_ITEM, CRC-32C on every block header |
| ZFS | R/W | OpenZFS source | 4 vdev labels, 128-entry uberblock ring with Fletcher-4, XDR NVList, MOS + DSL dir/dataset + microzap for root, pool version 28 |
| UFS1/FFS | R/W | FreeBSD sys/ufs/ffs/fs.h |
fs_magic=0x011954 at sb offset 1372, cg_magic, fs_cs summary block |
| UBIFS | RO | Linux fs/ubifs/ |
Read-only; no writer (LPT/TNC trees are multi-week) |
| JFFS2 | RO | Linux fs/jffs2/ |
Read-only; log-structured node-scanner only |
| YAFFS2 | RO | Aleph One YAFFS2 spec | Read-only; OOB/ECC layout not emittable |
| BFS (BeOS/Haiku) | RO | Haiku OS source | Read-only; superblock surfacing only |
| FS | State | Spec | Notes |
|---|---|---|---|
| HFS classic | R/W | Apple "Inside Macintosh: Files" (1992) | Real B-tree catalog + extents trees, 102-byte file records, 70-byte dir records, 46-byte thread records with (parent, name) sort |
| HFS+ | R/W | Apple TN1150 | Catalog file record at spec 248 bytes; dataFork @ 88, resourceFork @ 168 |
| APFS | R/W | Apple File System Reference | NX superblock + container OMAP + APSB volume + FS B-tree, Fletcher-64 checksums, single-container WORM |
| MFS | R/W | Inside Macintosh V (1985) | Pre-HFS flat FS; drSigWord=0xD2D7 |
| FS | State | Spec | Notes |
|---|---|---|---|
| Commodore 1541 (.d64) | R/W | VICE emulator docs | 174 848 bytes, 35 tracks, directory at T18S1+ |
| Commodore 1571 (.d71) | R/W | VICE docs | 349 696 bytes, dual-side BAM |
| Commodore 1581 (.d81) | R/W | VICE docs | 819 200 bytes, 80 × 40 × 256, DOS "3D" signature |
| C64 tape (.t64) | R/W | T64 format spec | "C64S tape image file" header |
| Amiga ADF (OFS/FFS) | R/W | Amiga ROS docs | 901 120 (DD) / 1 802 240 (HD), "DOS\1" magic, BSDsum checksums |
| Amiga DMS | R/W | xDMS source | "DMS!" header with CRC16 |
| Atari ST MSA | R/W | MSA format spec | 0x0E0F BE magic, per-track RLE |
| Atari 8-bit ATR | R/W | AtariDOS 2 VTOC | 16-byte header + 92 160 sector bytes, VTOC @ sector 360 |
| Apple DOS 3.3 | R/W | Apple DOS manual | 143 360 bytes, catalog at T17S15 chain, 35-byte entries |
| ProDOS | R/W | ProDOS TRM | 143 360 (5.25") / 819 200 (800K), 39-byte entries |
| BBC Micro DFS (.ssd) | R/W | Acorn DFS spec | 102 400 (40-track) / 204 800 (80-track), 31×8-byte dir entries |
| ZX Spectrum SCL | R/W | TR-DOS .scl spec | "SINCLAIR" magic + LE32 trailing sum |
| ZX Spectrum TR-DOS (.trd) | R/W | TR-DOS spec | 655 360 bytes, 160×16×256 |
| Amstrad CPC DSK | R/W | CPCEMU disk format | "MV - CPCEMU Disk-File" magic |
| HP LIF (.lif) | R/W | HP LIF utility manual | 256-byte sectors, flat directory, 0x8000 BE magic |
| CP/M 2.2 | R/W | DR CP/M 2.2 BDOS reference | 256 256 bytes (77×26×128), 64-entry flat directory |
| DEC RT-11 (.rt11/.rx01) | R/W | DEC RT-11 Volume + File Formats | RX01 8" SSSD ~256 KB, 512-byte blocks, RAD-50 6.3 names |
| OS-9 RBF (.os9/.rbf) | R/W | Microware OS-9 Tech Reference | CoCo 35-track DSDD ~315 KB, 256-byte sectors, big-endian |
| Commodore G64 (.g64) | RO | VICE emulator docs | GCR-encoded track dump; raw GCR bytes per track |
| Commodore NIB (.nib) | RO | nibtools docs | Raw 84-half-track nibble dump |
| FS | State | Spec | Notes |
|---|---|---|---|
| ISO 9660 | R/W | ECMA-119 | PVD at sector 16, VDST @ 17, L+M path tables, 2 KB blocks, flat directory (no Rock Ridge/Joliet) |
| UDF | R/W | ECMA-167 | VRS (BEA01/NSR02/TEA01) @ 16-18, Main VDS @ 32-35, AVDP @ 256. CRC-16-XMODEM + TagChecksum on every tag |
| FS | State | Spec | Notes |
|---|---|---|---|
| SquashFs | R/W | SquashFs 4.0 spec | hsqs magic, zlib + Adler-32, FlagNoFragments |
| CramFs | R/W | Linux fs/cramfs/ |
0x28CD3D45 magic, CRC-32, zlib blocks |
| RomFs | R/W | Linux fs/romfs/ |
-rom1fs- magic, BE fields, self-correcting checksum |
| EROFS | RO | Linux fs/erofs/ |
Read-only; variable-length encoded inodes |
| Minix v1/2/3 | R/W | Linux fs/minix/ |
Superblock magics 0x137F/0x138F/0x2468/0x2478/0x4D5A |
| VDFS | R/W (writer toy) | Gothic-game engine reverse engineering | Proprietary, no public spec; writer is flat-no-checksum (left in place until spec available) |
Containers holding filesystem payloads; the inner FS is a separate descriptor.
| Format | State | Reference | Notes |
|---|---|---|---|
| VHD (Microsoft) | R/W | VHD format spec | "conectix" magic, fixed + dynamic |
| VMDK (VMware) | R/W | VMDK spec | "KDMV" magic |
| QCOW2 (QEMU) | R/W | QCOW2 docs | Sparse format; WORM wraps raw disk with L1/L2, v2 |
| VDI (VirtualBox) | R/W | VBox source | Single-disk format |
| BIN/CUE | R/W | cue sheet | Raw disc image; WORM emits ISO 9660 cooked sectors |
| MDF | R/W | Alcohol docs | Alcohol 120% — WORM emits ISO 9660 |
| NRG | R/W | Nero format docs | Nero — WORM emits ISO 9660 with NER5 footer |
| CDI | R/W | DiscJuggler docs | DiscJuggler — WORM emits ISO 9660 with CDI v2 footer |
| DMG | R/W | libdmg-hfsplus | Apple disk image — WORM emits raw mish blocks per partition (no zlib/bz2/lzfse encoding); read-side handles all four compressions |
CompressionWorkbench treats executable packers (UPX, demoscene compressors, classic DOS packers, modern PE protectors) as pseudo-archives — they get the same List / Extract interface as ZIP or TAR. Each descriptor surfaces a metadata.ini with detected evidence (version byte, signature offset, packer-header fields), an mz_header.bin / hunk_header.bin snapshot when applicable, and a packed_payload.bin (or an in-process decompressed body for UPX).
Detection is hardened against tampered binaries: the structural fingerprint (BSS-style first section + RWX flags + entry point in last section + payload entropy ≥ 7.5) catches binaries where the UPX! magic and section names have been wiped. A brute-force PackHeader scan validates format / method / uLen / cLen / level / version even when the magic bytes are zeroed.
| Capability | State | Notes |
|---|---|---|
| Detection (canonical UPX) | Yes | Section names UPX0/UPX1/UPX2; UPX! packer-header magic; $Info: This file is packed with the UPX… tooling banner |
| Detection (tampered) | Yes | Brute-force PackHeader scan validates every field even with magic wiped. Structural fingerprint requires BSS-style first section (RawSize=0 + VirtualSize>0) |
| NRV2B_LE32 (method 2) | Yes | In-process via BB_Nrv2b |
| NRV2D_LE32 (method 3) | Yes | In-process via BB_Nrv2d (UCL-spec port) |
| NRV2E_LE32 (method 8) | Yes | In-process via BB_Nrv2e (UCL-spec port) |
| NRV2B_LE16 / LE8 (methods 4, 6) | Yes | Nrv2bBuildingBlock.DecompressRaw{Le16,Byte} |
| NRV2D_LE16 / LE8 (methods 5, 7) | Yes | Nrv2dBuildingBlock.DecompressRaw{Le16,Byte} |
| NRV2E_LE16 / LE8 (methods 9, 10) | Yes | Nrv2eBuildingBlock.DecompressRaw{Le16,Byte} |
| LZMA (method 14) | Yes | Via BB_Lzma |
| DEFLATE (method 15) | No | Rare in UPX output; deferred |
| PE header reconstruction | No | Surfaces decompressed_payload.bin as a raw blob; IAT reconstruction + OEP restoration is delegated to upx -d |
Detection confidence is exposed as a 3-tier DetectionConfidence enum:
- None — no evidence; descriptor's
List/Extractthrows soFormatDetectorfalls back to plain PE/ELF resource enumeration. - Heuristic — structural fingerprint match (BSS-style first section + RWX flags + entry in last section + high entropy) but no PackHeader.
- Confirmed — PackHeader found (with or without intact magic), canonical section names present, or tooling banner intact.
The evidence record exposes every contributing signal so users can audit why a binary was flagged.
Full decompression would require the original tool's runtime stub or a bespoke decompressor that has not been ported.
| Packer | Container | Signature | Reference |
|---|---|---|---|
| PKLITE | DOS .exe |
PKLITE Copr. / PKlite Copr. in first 1 KB |
bp/pklite |
| LZEXE | DOS .exe |
LZ91 / LZ09 signature in first 1 KB |
Bellard's page |
| Petite | Win32 PE | .petite* section name or Petite literal |
Un4seen Petite |
| Shrinkler | Amiga HUNK | HUNK magic (0x000003F3) + Shrinkler literal |
Blueberry's repo |
| FSG | Win32 PE | FSG! magic in first 16 KB |
x86asm forum |
| MEW | Win32 PE | Section name starting with .MEW / MEW |
Northfox page |
| MPRESS | Win32 PE / Linux ELF | .MPRESS1 / .MPRESS2 section or MPRESS / MATCODE literal |
MATCODE |
| Crinkler | Win32 PE | Crinkler / crinkler literal |
crinkler.net |
| kkrunchy | Win32 PE | kkrunchy literal |
Farbrausch |
| ASPack | Win32 PE | .aspack / .adata section or ASPack literal |
aspack.com |
| NsPack | Win32 PE | .nsp* section name or NsPack literal |
PEiD DB |
| Yoda's Crypter | Win32 PE | .yC / yC section or Yoda's literal |
Yoda's site |
| ASProtect | Win32 PE | ASProtect literal |
aspack.com/asprotect |
| Themida | Win32 PE | Themida / WinLicense literal |
Oreans |
| VMProtect | Win32 PE | .vmp* section or VMProtect literal |
vmpsoft.com |
| BB | Description | Status |
|---|---|---|
BB_Nrv2b |
UCL NRV2B LE32 — LZ77 + interleaved variable-length integer bit stream | Spec-faithful, UPX-compatible decoder |
BB_Nrv2d |
UCL NRV2D LE32 — three-bit-per-iter offset varint + low-bit length tying | Spec-faithful, UPX-compatible decoder |
BB_Nrv2e |
UCL NRV2E LE32 — entropy-refined NRV2D variant | Spec-faithful, UPX-compatible decoder |
BB_Lzma |
LZMA dictionary compressor | Pre-existing |
Each UCL BB emits a 4-byte little-endian uncompressed-size header so the building block can round-trip standalone via IBuildingBlock.Compress / Decompress. Nrv2{b,d,e}BuildingBlock.DecompressRaw(compressed, exactOutputSize) helpers are available for callers reading bare streams (UPX payloads, OS/2 drivers, retro-computing collections) without the size header.
CompressionWorkbench exposes the same core library through five different surfaces. Pick the one that fits the task.
Universal archive tool with smart conversion, optimal re-encoding, benchmarking, and analysis built in.
| Command | Alias | What it does |
|---|---|---|
list <archive> |
l |
List contents of an archive |
extract <archive> [files...] |
x |
Extract files from an archive |
create <archive> <files...> |
c |
Create a new archive |
test <archive> |
t |
Test archive integrity |
info <archive> |
- | Show detailed archive information |
convert <input> <output> |
- | Convert between archive formats |
optimize <input> <output> |
opt |
Re-encode with optimal compression |
benchmark <file> |
bench |
Benchmark all building blocks on the supplied data |
analyze <file> |
- | Run binary analysis (detection + entropy + trial decompress) |
auto-extract <file> |
- | Recursive nested extraction (see below) |
batch <dir> |
- | Scan a directory in parallel and aggregate format stats |
suggest <file> |
- | Platform-aware format recommendation |
reverse-engineer <tool> |
reveng |
Black-box probing of an unknown compression tool |
tool (init|list|add|run|remove) |
- | Manage external-tool templates |
formats |
- | List all supported formats |
Examples:
cwb list archive.zip
cwb extract archive.7z -o ./output
cwb x archive.rar -p mypassword
cwb create output.zip myDir file1.txt *.txt
cwb create output.7z file.txt --method lzma2+
cwb convert input.tar.gz output.tar.xz
cwb optimize input.zip optimized.zip
cwb benchmark largefile.bin
cwb analyze unknown.bin
cwb auto-extract sample.vhd --recursive
cwb suggest big.csv # "→ consider zstd -19 (columnar/text, moderate entropy)"3-tier conversion model. cwb convert picks the cheapest strategy that preserves data:
| Tier | Strategy | Example |
|---|---|---|
| 1 | Bitstream transfer (zero decompression) | .gz ↔ .zlib, .zip ↔ .gz |
| 2 | Container restream (decompress wrapper only) | .tar.gz → .tar.xz |
| 3 | Full recompress (extract + re-encode) | .zip → .7z |
Method+ system. Append + to any method name for optimal encoding: deflate+ uses Zopfli, lzma+ uses Best, lz4+ uses HC.
Tool templates. cwb tool registers external CLI tools (7z, binwalk, file, trid, …) in ~/.cwb-tools.json. Templates use {input}, {output}, {outputDir} placeholders and can capture stdout, pipe stdin, or set a timeout. cwb tool init pre-populates templates for common tools.
The archive browser is the conventional half: file list with icons, columns (name, size, compressed, ratio, method, modified), open / extract / create / test flows, preview window (text + hex), properties dialog with compression-ratio visualisation, benchmark tool, and Explorer context-menu integration (Compression.Shell).
The analyser is the interesting half. When you drop an unknown binary on the UI, it never says "unsupported" — it shows you what the bytes look like. The Binary Analysis wizard has a toolbar that walks you through progressively deeper investigation:
- Scan Results — every registered magic-byte signature that matches, with offsets and confidence.
- Fingerprints — algorithm identification from byte-distribution and byte-pair statistics.
- Entropy Map — per-region entropy profile with CUSUM change-point detection and 1D-Canny edge sharpening. Structured data (text, tables) shows low entropy; compressed/encrypted regions show high entropy; boundaries between them are marked.
- Trial Decompress — runs every registered stream decompressor in parallel with per-trial timeout and early-terminates on a low-entropy output. If any decoder produces plausible output, it is offered for preview.
- Chain — multi-layer compression reconstruction (e.g.
gzip(bzip2(data))). Recursive trial decompression continues until entropy stops dropping. - Statistics — full byte distribution, bigram histogram, chi-square randomness test, longest run, run-length distribution.
- Strings — ASCII / UTF-8 / UTF-16 string search with regex support.
- Structure — ImHex/010-style
.cwbttemplates. Built-in templates ship for ZIP, PNG, BMP, ELF, Gzip; you can write your own using u8-u64 / i8-i64 / f16-f64 (LE/BE), char/u8 arrays, BCD, fixed-point, color, date/time, and network types with dynamic length via field references or repeat-to-EOF.
The Heatmap Explorer is the visual first pass. A 16×16 colour grid represents a proportional region of the file. Each of the 256 cells is one tile.
| Cell colour | Meaning | Entropy |
|---|---|---|
| Blue | Low entropy — zeros, padding, simple headers | 0.0–3.0 |
| Green | Structured data — tables, records, text | 3.0–5.5 |
| Orange | Compressed data | 5.5–7.5 |
| Red | Random / encrypted (incompressible) | 7.5–8.0 |
| Purple | A known format signature was detected here | any |
Click any cell to subdivide into another 16×16 grid — it recursively zooms in on a region. Hovering shows offset, size, entropy, unique-byte count, and the detected signature (if any). Extract on a purple cell saves just that region to a file. The explorer only samples each block, so it handles arbitrarily large files without loading them into memory. Accessible from the analyser tab or standalone at Tools → Heatmap Explorer.
Everything the UI exposes is available as a .NET library under Compression.Analysis:
- Signature Scanner — magic-byte detection for every registered format (hash-indexed, O(n)).
- Algorithm Fingerprinting — statistical fingerprinting against known compression-output distributions.
- Trial Decompression —
TryAllAsyncruns every registered stream decompressor in parallel with per-trial timeout and early termination. - Chain Reconstruction — discovers layered compression.
- Entropy Mapping — per-region entropy profiling with boundary detection; multi-resolution entropy pyramid (64 KB / 8 KB / 1 KB / 256 B), CUSUM binary segmentation, KL-divergence + chi-square boundary validation, 1D-Canny edge sharpening.
- String Extraction — ASCII / UTF-8 / UTF-16 with regex.
- Structure Templates —
.cwbttemplate language. - Streaming Analysis — reads the first 64 KB for magic/header; computes entropy in 64 KB chunks; returns per-chunk entropy profiles for arbitrarily large files.
- Black-box tool integration —
ExternalToolRunner,ToolOutputParser,CrossValidator,FallbackDecompressorwith auto-discovery of tools onPATH. - AutoExtractor — recursive nested extraction: archives inside archives, disk images → partition tables → filesystems → files. Configurable max depth (default 5) and file-size limits.
- BatchAnalyzer — parallel directory scan with aggregate format statistics.
- PayloadCarver, StringsExtractor, EntropyHeatmap — standalone helpers.
Detection pipeline. Magic bytes → parallel trial decompression (early-termination on low-entropy output) → extension fallback → deep probe (header parse + structural validation + integrity check).
Partition table support. MbrParser (four primaries at 0x1BE + extended/logical chain) + GptParser (EFI PART at LBA 1) + PartitionTypeDatabase (type-byte / GUID → filesystem name). Recursive descent via --recursive: disk image → partition table → filesystem → archive chain.
Two complementary flows for reverse-engineering unknown compression tools and file formats.
Black-box tool probing runs the target tool with ~40 controlled probe inputs (empty, single byte, incrementing patterns, text, random data, various sizes 0–64 KB), cross-correlates all outputs, and reports: magic bytes, size-field offsets (LE/BE, 2/4/8 byte), the compression algorithm (trial decompression against all 49 building blocks), filename storage (UTF-8 / UTF-16), determinism, payload entropy.
cwb reverse-engineer MyTool.exe "{input} {output}"
cwb reverse-engineer packer.exe "--pack {input} --out {output}" --timeout 10000The GUI offers the same via Tools → Reverse Engineer Format as a step-by-step wizard with progress reporting.
Static analysis mode works when you have archive files with known original content but no tool to run. StaticFormatAnalyzer accepts pairs of (original, archived) and locates where the content appears inside the archive — verbatim or compressed with any known building block — then infers header/footer structure, size fields, and compression algorithm without ever executing an external tool.
Explorer context-menu integration. Right-click any file to invoke cwb commands directly: list, extract, test, optimise.
Self-extracting archive stubs for console and GUI use. The stub is a normal cwb-style reader prepended to an archive overlay; running the resulting exe extracts in place. Used for single-file distributions via Costura.Fody.
The test suite includes three tiers of external validation beyond the standard self-round-trip tests.
Self round-trip. All formats that support both create and extract are tested by creating an archive, extracting it, and verifying the output matches the original. Runs as part of the normal dotnet test.
External tool interop (Category=EndToEnd). Verifies our output is readable by external tools and vice versa. Dynamic tool discovery via PATH and common install locations; gracefully skips when tools are unavailable. Covered: 7z, gzip, bzip2, xz, zstd, lz4, tar. Both directions are tested: create with our library → read with external tool, and vice versa.
dotnet test --filter "Category=EndToEnd".NET BCL interop. Verifies interoperability with System.IO.Compression (GZipStream, DeflateStream, BrotliStream, ZipArchive).
OS integration (Category=OsIntegration). Platform-specific tooling:
- Windows — PowerShell
Compress-Archive/Expand-Archive, Windowstar,certutil,Mount-DiskImage, DISM - Linux —
mtools(FAT),genisoimage(ISO),qemu-img(virtual disks),debugfs(ext4),cpio
Platform detection + Assert.Ignore means tests never fail due to missing prerequisites.
dotnet test --filter "Category=OsIntegration"Filesystem validation matrix. Compression.Tests/ExternalFsInteropTests.cs wires 18 filesystem-image tests against the tools below:
| Tool | Present? | Validates |
|---|---|---|
| 7-Zip (portable) | Bundled | NTFS, FAT, exFAT, ext, HFS, HFS+, ISO 9660, UDF, SquashFS, CramFS (list/extract) |
| qemu-img | Optional — install from https://qemu.weilnetz.de/w64/ | VHD, VMDK, QCOW2, VDI (info + check) |
| DISM | Windows built-in | WIM, VHD, ISO |
| chkdsk | Windows built-in (admin + mounted volume) | FAT, exFAT, NTFS |
| mtools | Optional — install from Cygwin | FAT (non-admin) |
| WSL + mkfs.* / fsck.* | Optional — wsl --install as admin + reboot |
ext / XFS / Btrfs / F2FS / JFS / ReiserFS / UDF / UFS |
| DOSBox-X + MS-DOS 6.0/6.2 | Opt-in — set CWB_MSDOS_DBLSPACE_BOOT_IMG |
DBLSPACE CVF (DBLSPACE /CHKDSK D:) — see Compression.Tests/Support/MsDosImageStaging.md |
| DOSBox-X + MS-DOS 6.22 | Opt-in — set CWB_MSDOS_DRVSPACE_BOOT_IMG |
DRVSPACE CVF (DRVSPACE /CHKDSK D:) — see Compression.Tests/Support/MsDosImageStaging.md |
| DOSBox-X + FreeDOS LiveCD | Auto (hash-pinned download) | FAT (CHKDSK D: from FreeDOS) — gate is [Explicit] because the LiveCD welcome screen races the autoexec |
Tests skip cleanly when the tool is missing; they never fail the suite on a tool-deficient machine.
Principles:
- No external compression code. Every algorithm is implemented from scratch in C#.
- Composable primitives.
Compression.Coreprovides the building blocks;FileFormat.*/FileSystem.*projects compose them.Compression.Corenever implements format interfaces — it is pure algorithm. - Stream-oriented. All compression / decompression operates on
System.IO.Stream. - Immutable headers. File-format header structures are immutable record types.
- Testability. Every component is independently testable; NUnit tests cover primitives, format round-trips, and external interop.
- .NET 10 / C# 14. Latest language features, nullable reference types, warnings-as-errors.
Registry. The source generator (Compression.Registry.Generator) emits a RegisterFormats() method listing every IFormatDescriptor and a FormatDetector.Format enum with one entry per format — zero reflection, zero hand-maintained lists. The same mechanism discovers IBuildingBlock implementations in Compression.Core.
dotnet build CompressionWorkbench.slnxdotnet test- RFCs: RFC 1951 (Deflate), RFC 1952 (Gzip), RFC 1950 (Zlib), RFC 7932 (Brotli), RFC 8878 (Zstandard)
- libxad — the external archive decompressor, format reference
- XADMaster / The Unarchiver — modern continuation of libxad
- libarchive — multi-format reference
- Wikipedia list of archive formats
- ArchiveTeam Just Solve The File Format Problem — compression format documentation
- 7-Zip — multi-archiver reference
- Matt Mahoney's data-compression page — context-mixing compressors + corpus