582651 Project in Information-Theoretic
Modeling (2 cr)
|
|
Compress data. We will give the files to compress, your task is to squeeze them into as small as possible. The catch is that we also count the size of the decompressor program into the total score. |
Your task is to design, document, and implement a compressor and a decompressor program. All imaginable lossless compression methods are allowed. The compression rate of the program will be evaluated with a given set of data files. However, scoring will not be based solely on compression rate. Instead the size of the decompressor program is added to the size of the compressed files. Using too complex a decompressor program (e.g., a la style ''print "data...";'') will not generally be a good idea.
As a special feature, in some cases both the compressor and the decompressor programs have access to side information related to the file to be compressed. Using the side information is not obligatory, but it makes compression significantly easier, thus reducing the size of the compressed file.
Your documentation, due December 11, should describe
the ideas/observations/principles underlying the compressor and the
decompressor algorithms. A complete project is a tar-file (all links
should be relative and all files in the directory returned)
including
» high-level description (in English) of the algorithms,
» program source for both compressor and
decompressor programs.
The documentation can be delivered as a separate tar-file that can be extracted in the same directory as the programs.
You can use any programming language you like, as long as we can run your program on the CS department Linux workstations. You can deliver source code or binaries or a combination of both — as you find best — but it must be possible to run the whole think using the script decompress.sh, as described in the project template.
The form of the report is a WEB-page (the HTML template is included in the project template).
» Download the project template
» Download extra Data for 2nd Round
» Download Updated Project Template with new Data for 3rd Round
» Download extra Data for 3rd Round
If you run into problems, email questions to both the instructor and the teaching assistant.
The performance of your algorithm will be evaluated with the given set of data files. The loss-function is given by:
where N is the number of data files, is the size of the decompressor program, and is the size of the compressed version of the ith file.
The competition is iterative, meaning that there are several deadlines. After each deadline your success is evaluated and scores are published. After each round you are also allowed to submit your own data which will be used in the competition on the remaining rounds. Each group is allowed a variable amount of data, the amount depending on your success on earlier rounds. Because you can probably compress your own data better than other groups it pays off to try to do well already on the first rounds. Project points will be based only on the final round scores though. Details of the data submission procedure are annouced later.
November 3 First meeting (C222); participation obligatory.
November 16 (12 o'clock noon) First deadline. November 23 (12 o'clock noon) Second deadline. November 30 (12 o'clock noon) Third deadline. December 7 (12 o'clock noon) Final code deadline. December 17 (12 o'clock noon) Project Report deadline. |
University of Helsinki | Department of Computer Science | Helsinki Institute for Information Technology |