Team Bucket

Project in Information Theoretic Modeling
Jani Rahkola, Paul Saikko

Implementation

Both Jani and Paul participated in exploring the data, generating ideas and finally implementing them.

The decompressor and compressor were both implemented with Python 3. A single byte at the beginning of each compressed file is used to indicate which compression method was used to create it. When appropriate, we used lzma and bzip2 compressors in addition to domain-specific methods to compress the datasets. Finally, we compressed the decompressor script itself to make up for not using a compiled language and to allow us to keep the script somewhat readable. The decompressor program is decompressed before decompressing the data in the decompress.sh script.

Compressors used

LZMA

The poorly documented LZMA is a LZ77-based compression algorithm. It combines the dictionary-based Lempel-Ziv approach with range encoding, typically resulting in high compression ratios. The xz implementation was used instead of 7zip because it provided an interface similar to that of gzip and bzip2.

A Lempel-Ziv parsing of a string S is a sequence of tuples (d0,l0), (d1,l1), ... , (dm,lm) where each tuple (di,li) defines a substring of S starting at index k, where k is the length of the string encoded by (d0,l0), ... , (di-1,li-1). If some prefix pk of S[k..n) begins in S[0..k), it can be efficiently represented in the Lempel-Ziv parsing by letting di be the distance from k to the index where pk begins and li be the length of pk. This is essentially using the already parsed prefix S[0..k) as the dictionary for encoding the suffix S[k..n). In practice the dictionary is usually limited to some fixed size M using a sliding window, so a substring S[k-M,k) is used as the dictionary. If no such pk exists (as is always the case for i=0), the tuple encodes only the single new symbol, (di,li) = (0,S[k]). Side note: (0,S[0]), (0,S[1]), ... , (0,S[n-1]) is a valid (though inefficient) parsing. As a short example, the string ananassana could have a LZ parsing of (0,'a'), (0,'n'), (2,3), (0,'s'), (1,1), (5,3).

LZMA stores these LZ tuples as one of seven different packet types. In LZMA a tuple that has a symbol instead of a length is called a literal packet. An ordinary tuple with both a length and a distance component is called a match packet. In addition to these two LZMA has five specialized versions of the match packet. A so called short repetition is a match where the length is one and the distance equals to the distance in the previous match. Thus neither length nor distance need to be encoded. There are also four separate "long repetition" packets where the distance is equal to the distance in one of the four previous matches. A prefix code is used to tell the packets apart.

LZMA uses a range encoder with a highly complex model to compress the packets as a bit stream. The type prefix, literal value, distance and length are compressed with separate models:

For finding matches, LZMA maintains an array of binary search trees containing every substring in the current dictionary (which is some suffix of the already compressed data). This array is indexed by taking a hash of the first few (2, 3, or 4 depending on settings) characters of a substring. The binary search trees are based on lexicographical ordering, with the additional constraint that substrings with a higher index in the dictionary are at the top of the tree. This is useful because LZMA uses a sliding window to keep its dictionary size constant, which can be implemented without rebalancing the binary trees as it involves removing only leaf nodes. LZMA updates these data structures after every new byte read from the data to be compressed.

A greedy algorithm that maximizes the match length is optimal when considering only fixed-length encodings of the matches, but this does not hold for other encodings in general. Specifically, finding an optimal parsing when using Huffman codes (as in gzip) or range encoding (as in lzma) is much more difficult. The use of the short and long repetition packets in LZMA makes finding a globally optimal parsing even more infeasible. Instead, LZMA uses a dynamic programming algorithm combined with specialized heuristics to explore some subset of possible future parsings and selects a good match based on them. Unfortunately it seems that a complete description of this process exists only in the source code.

Bzip2

Bzip2 is a compression algorithm based on the Burrows-Wheeler transform. Like many compression algorithms, it splits the data to be compressed into smaller (at most 900 kB) blocks to lower compression time and memory requirements. Each block is then compressed using the following steps:

Datasets

curve1.dat

The file contains 2000 decimal values in ASCII format, separated by newlines. When plotting the values sequentially, it looks that the data is sampled from some sinusoidal function.

Given the nature of the data, the first idea was to try to guess the generating function and use a fitting tool (the curve_fit function of scipy) to find suitable parameters. After a short while it became clear that the micro level dominates the macro level. The differences between successive data points were approximately as large as the random noise on the underlying function. Thus even with a perfect macro level fit the residuals would have the same range that you get by just delta encoding the original data. Our compressor maps the differences between successive values to positive integers and encodes each one as two bytes. This was slightly improved by storing the least significant byte of each value first without any compression, then using lzma to compress the most significant bytes.

paleo.csv

This large spreadsheet of 4129 lines contains anonymized real-world paleontological data with (nearly) duplicate columns, aggregate columns, missing values, and even division by zero errors from excel. The imperfect nature of the data meant that some compression ideas were made impractical by the necessity of replicating the same errors and omitted values in the decoding process. The data contained only one column consisting of text, so we started by separating it from the numerical columns and compressed it with lzma.

For the numerical columns we had two strategies. Many of the values in the aggregate columns labeled "ww/dm", "ww/wh" and "uw/dm" could be calculated from the respective "ww", "dm", "wh", and "uw" columns. However, some values in the aggregate columns did not follow the same pattern, so we started by storing the difference between the value in the data and the value got by recalculating the aggregate. The problem with this was that the differences had the same 9 decimals of accuracy as the original values. Thus it turned out to be better to replace the aggerate values with a special symbol "-" when they could be recalculated and otherwise just leave them be. The spreadsheet also contained two columns labeled "max dm" with identical integer values on many rows, so the second of these columns was stored as a difference to the first column.

Last step was to compress the data with bzip2. Using ";" as a separator seemed natural, but " " gave better compression. We believe this is caused by the Burrows-Wheeler transform in bzip2 as " ", unlike ";", comes before digits in ASCII order. We were also able to improve the compression ratio slightly by changing the order of the columns.

Note that we compressed the ASCII representation of the data, not the byte representation of the floating point numbers. We tried storing some of the smaller values as integers in a single byte but after bzip2 this gave worse compression. In ASCII the data has only 14 unique symbols, and thus compresses well. (Back of the envelope calculation: Huffman coding gives codes with length at most 4. Thus 7 digits or less beats a 32 bit float.)

ty.txt

The file contains 10000 decimal values in ASCII format. Plotting the data, it appeared to contain more densely sampled values from the same generating function as in curve1.dat. The values appear to have roughly the same amouint of noise as well, so we ended up using the same approach as with the first dataset. The data has some significant outliers that we hardcoded into the decompressor. After this the delta encoded values were small enough that we could assign a 15-bit codeword for each one.

As with the first dataset, we were able to improve slightly on the fixed length encoding by compressing the 8 most significant bits of each value with lzma and storing the 7 least significant bits as is. This works because any regularity present in the values exists in the most significant bits (lzma was able to compress them by roughly 700 bytes), while the least significant bits consist almost entirely of noise and thus cannot be compressed.

caravan.dat

This data comprised of 5822 rows and 85 columns of small integer values. Side data was also available but was not used due to the small amount of information it provided (5822 bits, 728 bytes).

We noted that 5 of the columns contained only binary values, and merged those into 2 columns of data. In addition to that, there were 6 columns with a smallest value of 1 instead of 0, so the values in those columns were decreased by one to reduce entropy somewhat. As an intermediate step, we reordered the columns of data. Each value was then stored in a single byte and the separators and newlines were removed. This data was then packed with lzma as a final step.

We used a greedy algorithm to find a permutation of columns which produced a good compression ratio with lzma. This resulted in a significant improvement of several thousand bytes in the final result.

final.dat

This dataset appears to be comprised of 20000 (x,y,z) coordinate triples, with the z values as the data to be compressed and the x and y values given as side information. Plotting the data we see that it forms a surface with a prominent fold in it. The points are seemingly not given in any nice order, so delta encoding would not produce good results here. We decided to try fitting once more.

In order to fit a reasonably simple function to the data, we first had to 'unfold' it. We searched for a plane that separated the upper and lower halves of the data. Then we simply mirrored the y coordinates of points under the plane with a constant offset, resulting in a surface that was uniform enough. We then fitted a function of form a + b*x + c*y + d*x**2 + e*y**2 + f*x*y + g*x**3 + h*y**3 + i*(x+y)**2 + k*(x-y)**3 + l*(x+y)**3 + m*2**(x-y) + j*3**(x-y) to the data. This gave us a close enough fit to represent each difference between actual and modeled value with 13 bits.

Just storing the differences between the data and the fitted function was not enough. Our method of straightening the surface required the z coordinate that the decompressor did not have. We had to use an extra bit for each value stored to tell whether or not the decompressor should mirror the y coordinate. The extra bit was added to the 13-bit encoding for a total of 14 bits per value. Instead of encoding the 14-bit values sequentially, we compressed the most significant byte of each value with lzma first, and stored the remaining 6 least significant bits separately. This made for a better final result than simply compressing the 14-bit values with lzma, because it both aligns the data given to lzma on the byte boundaries and does not attempt to compress the most random bits of each value.

Comparisons

We compared the compressed sizes of each of the data files with our compressor to what was produced by the existing compressors we used.

BucketBzip2LZMA
curve1.dat 390547985643
paleo.csv 65161102430108971
ty.txt 180252213124815
caravan.dat 463055908358339
final.dat 326284266051555