We used bzip2 and lzma to compress everything (data and decompressor) after the "tricks" we had made. Bzip2 and lzma give better compression than zip and gzip, so that's what we used. The decision between lzma and bzip2 for individial files was made by comparing the size of the compressed file. Whichever gave the smallest file was used.
Bzip2Bzip2 -compression uses 9 layers which are used one after the other.
Lzma
Lzma (Lempel-Ziv-Markov chain-Algorithm) is an improved version of the Lempel-Ziv (LZ77) algorithm. It uses range coding (which is basically the same thing as arithmetic coding, except that it operates on ranges of integers rather than the range [0,1)) to encode LZ77 sequences.
The LZ77 algorithm compresses data by finding parts that have occurred in the data before and replacing these parts with references to data already passed through the encoder. These references are so called length-offset pairs. When the decoder encounters such pair, it knows that the next length characters of the original data can be generated by going back offset characters in the set of already decompressed characters (for each character separately, i.e. the character in position i is found by looking at character at i - offset).
Both curves were compressed by first fitting them as best we could using the side data as x-coordinates and data as y-coordinates. Curve 1 was fitted in two parts. When x was greater than 0.8 we used a linear function and otherwise used a fifth order polynomial function. Curve 2 was fitted in three parts. When x was greater than 0.2 we used a fourth order polynomial and otherwise two different second order polynomial functions.
After fitting the curves, we then calculated the difference between the fitted curve and the actual data point. This resulted in much smaller numbers. These numbers we then multiplied by ten and removed the unnecessary '.0' from the end of those numbers to further compress them. This result was then saved as the compressed file. Uncompression was easy: the processs was just reversed. The result was also bzipped to get it even smaller.
The mushroom data was probably the easiest one to compress. We were able to find a few simple rules that classified the mushrooms solely based on the side data. We achieved this by simply printing the conditional distributions (i.e. the breakdown of different property values for both poisonous (p) and edible (e) mushrooms) for each property. This very quickly revealed some properties that separated a good part of the cases quite nicely. These samples were then removed from the data and the process was then removed a couple of times; on each round, a significant of amount of samples could be dropped until there were none left.
The compression process was thus trivial: print out 0 bytes. Decompression was done simply by applying the found rule to deduce the values from the side data. The exact rule can be found in the source code.
The very first attempt was to use the same approach than with the mushroom data. This soon proved out to be a dead end: the conditional distributions overlapped too much, yielding rather useless rules (some clear outliers were found, though, but the sets were too small to be of real use). Then we tried some clustering for the letters, so that we could have some smaller set of 16-dimensional parameter vectors that could be used to recognice the letters using nearest neighbour algorithm. This wasn't nearly accurate enough, since many letters were mixed up (i looks like j and l, etc.). So we dumped this approach.
A better approach, that seemed like it should work, was classification. We used almost every classification algorithm possible, but still were not able to get any good results out of them. Most classification algorithms, for example Fischer discriminant, nearest neighbour and perceptron, just didn't give even nearly good enough accuracy. The coding of the exceptions would have taken too much space to be useful. On the other hand, support vector machines were very accurate when they were used with gaussian kernel, but the problem was utterly ridiculous overfit. The overfitted model would have taken much more space to store than the original file. We ran into same overfitting problem with tree classification. Hyperpipe classifier was also attempted, it would construct a lightweigh model, but since many of the letters were alike, its accuracy was very poor.
As a last resort, we used a classification tree produced by an tree classifier algorithm to obtain a rule to get rid of at least some of the letters and hopefully make the distribution a bit less uniform in order to achieve better bzip2 compression. We took the biggest classes the tree produced. The tree was constructed in an exhaustive manner; all classes in the leaves consisted of a single value but the tree was also very big, so using the whole tree or even a significant part of it was not an option; some letters were just too close to each other. Taking the few distinctive big classes from the tree helped us to obtain a rule that shaved off about 15-20% of the letters, so it was mostly a symbolic gesture rather than good compression. Again, the exact rule can be found in the source code.
We just used lzma to compress these files, since it gave us very good compression. The compressed filesize was so small that we directed our efforts to the bigger files where more gains could be made.
There were some attempts to use the same techniques we tried with the letter data, but none of those seemed to work very well. We also tried to use the binary parameters of the side data to filter the possibilities before using classificators, but we got nowhere with that.
Luckily, this data had fewer different symbols and not so uniform a distribution as letters, so it compressed somewhat nicely with bzip2.
We assumed these are tricky enough and didn't even bother.
Most of our team's time was consumed in trying to find a good compression method for the letter data. Due to this, we weren't able to direct much effort into the forest data (although solving the problems we had with letters would probably have helped us in compressing the forest better, too). We could also have tried arithmetic (or range) coding as our "final" compression method instead of lzma/bzip2 to squeeze off some extra bytes.