The compression scheme we ultimately decided on utilized the xz program to apply LZMA compression, sometimes after preprocessing the data. XZ uses LZMA2, a modified version of Igor Pavlov's LZMA algorithm; in LZMA2, costs for attempting to compress non-compressable data are not especially high.
The decompression program is distributed as compressed C source code; it is decompressed and compiled by the included shell script. Compression is done with code written in C++.
No real attempt was made to deal with penalty data, which seemed inscrutable.
Lempel-Ziv-Markov (LZMA) is a lossless compression algorithm. It is a variation of the LZ77 algorithm and also takes into account repeated match distances. A dictionary is used to compress the matches that results in a set of symbols and references. Later, this set is compressed using a range encoder. LZMA uses contexts relative to the current bit to compress data. This means that, for the bit i, it might be defined as the bit i-offset so that only the value of offset has to be stored. LZMA is actually implemented as language with the following codes:
A range encoder can be seen as an arithmetic encoder that, instead of using the [0,1) range, uses a range of integers.
curve1.dat is clearly based on a sine wave; the frequency of the wave appears to change over the course of the data. However, the data was too fuzzed to recalculate and correct. Although the values in curve1.dat range from -1181.72 to 517.11, the finite differences of the data range only from -310.79 to 243.03 (both of which are within the range of a 16-bit unsigned integer, after multiplying all values by 100). In this way, the data in curve1.dat can be stored in 4000 bytes.
ty.txt was handled similarly to curve1.dat, taking the discrete difference and multiplying by 100. However, some values did not fit in the integer range, and had to be stored separately in a header at the beginning of the file. Decimal digits are also formatted somewhat differently.
paleo.csv required extensive preprocessing. All Finnish names in the first column are gathered and stored in the beginning of the file consecutively. Non-integer numbers in the data aren't written as floats, but as strings as they appear in the text (these seem to compress better!)
Each record (row) begins with a 16-bit set of flags; the flags store information about which values are stored in the record, and whether certain values can be reconstructed from certain other values. For example, the ww/dm column could often be reconstructed by dividing the ww value by the dm value, though this was not always the case. Unfortunately, no such method was found to derive the WER data.
The supplemental data accompanying final.dat correlated with the main data with ratios 0.4249514 and -0.396132 respectively. Using R's linear model fitting, a model was constructed that fit the data with a correlation ratio of 0.5811984.
Utilizing this model, the main data could be reduced to a form which fit each entry in 16 bits.
In testing, no gain was achieved by rearranging the data, using sparse storage formats, etc. after XZ compression was applied (XZ is particularly good at this data set, reaching 5% compression). Supplemental data consisted of one column of information, and was not useful for compression.
Although using low-level languages (C, C++) to manipulate data allowed for easy bit-twiddling and casting tricks, it would have been wiser to use a higher-level language in which debugging wasn't so painful, and which had more resources for statistics, math, etc. Most of the time spent on this project was hunting down strange memory bugs, or dealing with missing mathematical functionality in the C standard library.
Using Python or Ruby would have lead to much more rapid development, with probable loss in code size.
Llorenç helped with some compression ideas, writing the report, and the group name. Max did the coding.
Source is available here