3 Concepts: Information | Projects
Fri 12.10.
This is a two-part project. In the first part task is to design a compressor (and a decompressor) program for lossless compression of strings that are generated from a known Bernoulli distribution (you can imagine them to be coin-flips with known bias towards heads or tails). The algorithm should have a parameter indicating the bias in the data to be compressed. The parameter need not be included in the compressed file. The empirical performance of this algorithm should be analyzed with data sets sampled from Bernoulli distributions with different but known biases.
The main task in the second part is to design, document and implement a compressor (and a decompressor) program whose performance is nearly as good as that of the Part I compressor in data sets sampled from Bernoulli distributions with unknown biases. In other words, the Part II compressor does not have a 'bias parameter'. Some test data sets are included in the template package. Both adaptive, i.e., ones that improve their behavior online, or batch algorithms are acceptable. All compression must be lossless meaning that your decompression programs must be able to restore the original file exactly as it was before compression.
Your documentation should describe
the ideas/observations/principles underlying the two compressor
algorithms. A complete project is a tar-file (all links should
be relative and all files in the directory returned) including
» high-level description (in English) of the compressors,
» analysis and the results of the tests made,
» program source for both compressor and
decompressor programs,
» instructions on how to run the programs in a simple test case
The programs must compile in Linux. No fancy interface is required. The form of the report is a WEB-page (the HTML template is included in the project template).
If you run into problems, email questions to Teemu & Jukka.
In order to open the template file say 'tar xzvf
project1-template.tgz'.
As you will learn on the course, in the first part — when you are
dealing with data sets with known distribution — there is a certain
level of performance (expected code-length) that can not be beaten
even in principle. There are certain solutions that achieve this
performance up to a neglible margin.
The second part — the generating distribution is known to be within
a fixed set of distributions — is a bit more interesting. The
solution of the first part cannot be directly applied.
3 Concepts: Information |