Data Compression Explained (2012)

Author: Matt Mahoney
Copyright: (C) 2010-2012, Dell, Inc.

[!IMPORTANT] Licensing Terms: You are welcome to copy and distribute the contents of this work provided that:

The original license is included.

The material remains unmodified.

No fees or considerations are charged for the copies or derivative works.

Note: Standard "fair use" (cited quotes under one page) is exempt from these restrictions.

You can find the original text at http://mattmahoney.net/dc/dce.html.

📖 About this Book

This guide is designed for individuals aiming to grasp the mechanics of data compression or those intending to develop their own compression software.

Prerequisites for the reader:

Proficiency in programming.
Foundational mathematical knowledge.

🗺️ Roadmap of Content

Information Theory
- The impossibility of ~~universal compression~~.
- The boundaries of coding.
- The non-computable nature of modeling.
- Compression as an AI challenge.
Coding Techniques
- Huffman and arithmetic coding.
- Asymmetric codes (e.g., Rice, Golomb, unary, binary, numeric, extra bit).
- Archive formats (including encryption and error detection).
Modeling Strategies
- Fixed order: Bitwise, bytewise, and indirect.
- Variable order: PPM, CTW, and DMC.
- Context mixing: PAQ, ZPAQ, Crinkler, SSE, indirect SSE, linear/logistic mixing, and match.
Transforms
- RLE and the LZ family (LZ77, LZSS, deflate, LZMA, LZX, ROLZ, LZP, snappy, and deduplication).
- Dictionary encoding and LZW.
- BWT (including bzip2, BBB, DivSufSort, MSufSort v2/v3, Itoh-Tanaka, and Bijective).
- Symbol ranking and Predictive filtering (linear filtering, delta coding, color transforms).
- Specialized tools (precomp, E8E9) and Huffman pre-coding.
Lossy Compression
- Image formats (JPEG, PNG, GIF, BMP, TIFF) and JPEG recompression (WinZIP PackJPG, PAQ, Stuffit).
- Video (MPEG, NTSC) and Audio (Vorbis, Dolby, AAC, MP3, CD).

🧮 Information Theory

At its core, data compression is the practice of minimizing the bits required to transmit or store information.

Lossless vs. Lossy

Lossless: The original data can be reconstructed perfectly bit-for-bit.
Lossy: "Unimportant" data is permanently discarded. This is typically used for media where the human brain cannot perceive the missing details.

Example: The NTSC Broadcast Standard (1953–2009) The human eye is more sensitive to brightness (luminance) than to the fine differences between colors of similar brightness (chrominance). Consequently, NTSC transmitted color signals with lower resolution and a narrower frequency band than the monochrome signal.

The Lossy Pipeline:

The Mechanics of Compression

Every compression system relies on two primary components:

The Model: Predicts the probability distribution of the data (e.g., "In English, 'E' appears more often than 'Z'").
The Coder: Converts those probabilities into actual bits.

While coding has optimal, efficient solutions, modeling is an AI problem—it is an art of prediction.

Theoretical Limits

Information theory dictates that there are hard boundaries to lossless compression:

No Universal Compressor: It is mathematically impossible to create an algorithm that compresses every possible input.
Randomness: Truly random data cannot be compressed.
Recursion: You cannot compress data recursively; once compressed, the data appears random to the compressor.

The Optimal Bitrate: If a symbol has a probability $p$ , the theoretical minimum number of bits required to code it is: $\log_2 \frac{1}{p}$

Kolmogorov Complexity: Every string $x$ has a probability roughly equal to $2^{-|M|}$ , where $|M|$ is the length of the shortest possible description $M$ of that string. However, finding $M$ or calculating $|M|$ is uncomputable.

🚫 Why Universal Compression is Impossible

This is proven via the counting argument.

Suppose a program could compress all strings of length $n$ bits. To be reversible (lossless), every unique input must map to a unique output. However:

There are $2^n$ possible strings of length $n$ .
There are only $2^{n-1}$ possible strings shorter than $n$ bits.

Mathematically, the fraction of strings that can be compressed from $n$ bits down to $m$ bits is: $2^{m-n}$

Example: Fewer than $0.4\%$ of strings can be reduced by even a single byte. Therefore, any algorithm that compresses some strings must expand others. To prevent infinite expansion, one can add a single bit to the header to indicate if the data is stored uncompressed.

📉 Coding is Bounded

To illustrate the limits of coding, consider compressing the digits of $\pi$ . If we assume each digit (0-9) occurs with a probability of $0.1$ , we can compare different coding schemes:

Digit	BCD (Binary Coded Decimal)	Huffman	Binary (Unary-based)
0	`0000`	`000`	`0`
1	`0001`	`001`	`10`
2	`0010`	`010`	`110`
3	`0011`	`011`	`1110`
4	`0100`	`100`	`11110`
5	`0101`	`101`	...

setstats