Project in Information-Theoretic Modeling

Group K

Jaakko Louhio, Taneli Saastamoinen, Janne Salo

Description of our compression algorithm

We used bzip2 and lzma to compress everything (data and decompressor) after the "tricks" we had made. Bzip2 and lzma give better compression than zip and gzip, so that's what we used. The decision between lzma and bzip2 for individial files was made by comparing the size of the compressed file. Whichever gave the smallest file was used.

Bzip2
Bzip2 -compression uses 9 layers which are used one after the other.

Run-length encoding is used to compress any sequence of 4 to 255 consecutive symbols by replacing the sequence with the first four symbols and the repeat length between 0 to 251.
Burrows-Wheeler transform is a reversible block-sort that forms the core of bzip2. It results in reordering a buffer from the file, so that the reordered buffer is more likely to contain longer sequences of identical symbols.
Move to front transformation creates an array of each symbol appearing in the file. Each symbol in the file is then replaced by the index of the symbol in the array. After that, the symbol is moved to the front of the array. This results in long sequences of a symbol to be replaced by the index 0.
Run-length encoding is normally used at this stage to replace long sequences on zeros. Only now, the encoding contains symbol and two special codes RUNA and RUNB, which are used to represent the length of the sequence as a binary number. The code is terminated by reaching another symbol.
Huffman encoding is used to compress the data. If the file contains n symbols, then the constructed Huffman code contains codes RUNA, RUNB and n-1 symbol codes.
Multiple Huffman tables can be used if the cost of including another table is less than the gain from using it. The most appropriate table is then selected every 50 symbols. Maximum of six tables may be used.
Unary base 1 encoding is used if multiple Huffman tables are used. The selection of an Huffman table for a symbol sequence is done from a list by a zero-terminated bit sequence. This results in the expansion of the data, but the expansion is likely to be greatly over-shadowed by the greater compression from using multiple Huffman tables.
Delta encoding is used to compress Huffman tables. This works by storing each Huffman codeword as a difference from the previous codeword. This is fairly efficient, because the codewords start small and get bigger towards the least frequently used codewords.
Sparse bit array is used to show which symbols are used within a block and should be included in the Huffman tree. The 256 symbols are divided up into 16 ranges and only if a symbol within that range is used, then the whole range is included.

Lzma

Lzma (Lempel-Ziv-Markov chain-Algorithm) is an improved version of the Lempel-Ziv (LZ77) algorithm. It uses range coding (which is basically the same thing as arithmetic coding, except that it operates on ranges of integers rather than the range [0,1)) to encode LZ77 sequences.

The LZ77 algorithm compresses data by finding parts that have occurred in the data before and replacing these parts with references to data already passed through the encoder. These references are so called length-offset pairs. When the decoder encounters such pair, it knows that the next length characters of the original data can be generated by going back offset characters in the set of already decompressed characters (for each character separately, i.e. the character in position i is found by looking at character at i - offset).

Curve 1, curve 2

Both curves were compressed by first fitting them as best we could using the side data as x-coordinates and data as y-coordinates. Curve 1 was fitted in two parts. When x was greater than 0.8 we used a linear function and otherwise used a fifth order polynomial function. Curve 2 was fitted in three parts. When x was greater than 0.2 we used a fourth order polynomial and otherwise two different second order polynomial functions.

After fitting the curves, we then calculated the difference between the fitted curve and the actual data point. This resulted in much smaller numbers. These numbers we then multiplied by ten and removed the unnecessary '.0' from the end of those numbers to further compress them. This result was then saved as the compressed file. Uncompression was easy: the processs was just reversed. The result was also bzipped to get it even smaller.

Mushroom

The mushroom data was probably the easiest one to compress. We were able to find a few simple rules that classified the mushrooms solely based on the side data. We achieved this by simply printing the conditional distributions (i.e. the breakdown of different property values for both poisonous (p) and edible (e) mushrooms) for each property. This very quickly revealed some properties that separated a good part of the cases quite nicely. These samples were then removed from the data and the process was then removed a couple of times; on each round, a significant of amount of samples could be dropped until there were none left.

The compression process was thus trivial: print out 0 bytes. Decompression was done simply by applying the found rule to deduce the values from the side data. The exact rule can be found in the source code.

Letter

The very first attempt was to use the same approach than with the mushroom data. This soon proved out to be a dead end: the conditional distributions overlapped too much, yielding rather useless rules (some clear outliers were found, though, but the sets were too small to be of real use). Then we tried some clustering for the letters, so that we could have some smaller set of 16-dimensional parameter vectors that could be used to recognice the letters using nearest neighbour algorithm. This wasn't nearly accurate enough, since many letters were mixed up (i looks like j and l, etc.). So we dumped this approach.

A better approach, that seemed like it should work, was classification. We used almost every classification algorithm possible, but still were not able to get any good results out of them. Most classification algorithms, for example Fischer discriminant, nearest neighbour and perceptron, just didn't give even nearly good enough accuracy. The coding of the exceptions would have taken too much space to be useful. On the other hand, support vector machines were very accurate when they were used with gaussian kernel, but the problem was utterly ridiculous overfit. The overfitted model would have taken much more space to store than the original file. We ran into same overfitting problem with tree classification. Hyperpipe classifier was also attempted, it would construct a lightweigh model, but since many of the letters were alike, its accuracy was very poor.

As a last resort, we used a classification tree produced by an tree classifier algorithm to obtain a rule to get rid of at least some of the letters and hopefully make the distribution a bit less uniform in order to achieve better bzip2 compression. We took the biggest classes the tree produced. The tree was constructed in an exhaustive manner; all classes in the leaves consisted of a single value but the tree was also very big, so using the whole tree or even a significant part of it was not an option; some letters were just too close to each other. Taking the few distinctive big classes from the tree helped us to obtain a rule that shaved off about 15-20% of the letters, so it was mostly a symbolic gesture rather than good compression. Again, the exact rule can be found in the source code.

Rat, info

We just used lzma to compress these files, since it gave us very good compression. The compressed filesize was so small that we directed our efforts to the bigger files where more gains could be made.

Forest

There were some attempts to use the same techniques we tried with the letter data, but none of those seemed to work very well. We also tried to use the binary parameters of the side data to filter the possibilities before using classificators, but we got nowhere with that.

Luckily, this data had fewer different symbols and not so uniform a distribution as letters, so it compressed somewhat nicely with bzip2.

Data sent by other teams

We assumed these are tricky enough and didn't even bother.

Conclusions

Most of our team's time was consumed in trying to find a good compression method for the letter data. Due to this, we weren't able to direct much effort into the forest data (although solving the problems we had with letters would probably have helped us in compressing the forest better, too). We could also have tried arithmetic (or range) coding as our "final" compression method instead of lzma/bzip2 to squeeze off some extra bytes.