![]() |
582651 Project in Information-Theoretic
Modeling (2 cr)
|
|
Compress data. We will give the files to compress, your task is to squeeze them into as small as possible. The catch is that we also count the size of the decompressor program into the total score. |
Your task is to design, document, and implement a compressor and a decompressor program. All imaginable lossless compression methods are allowed. The compression rate of the program will be evaluated with a given set of data files. However, scoring will not be based solely on compression rate. Instead the size of the decompressor program is added to the size of the compressed files. Using too complex a decompressor program (e.g., a la style ''print "data...";'') will not generally be a good idea.
As a special feature, in some cases both the compressor and the decompressor programs have access to side information related to the file to be compressed. Using the side information is not obligatory, but it makes compression significantly easier, thus reducing the size of the compressed file.
Your documentation, due December 15 (midnight), should describe
the ideas/observations/principles underlying the compressor and the
decompressor algorithms. A complete project is a tar-file (all links
should be relative and all files in the directory returned)
including
• high-level description (in English) of the algorithms,
• program source for both compressor and
decompressor programs.
The documentation can be delivered as a separate tar-file that can be extracted in the same directory as the programs.
You can use any programming language you like, as long as we can run your program on the CS department Linux workstations. You can deliver source code or binaries or a combination of both — as you find best — but it must be possible to run the whole think using the script decompress.sh, as described in the project template.
The form of the report is a WEB-page (the HTML template is included in the project template).
• Download the first data set (a sequence of 200 integers)
• Download the second data set (paleontological data)
This data is published here solely for the use of the participants of this course.
The first column has been "anonymized"
by injectively mapping the original (mostly Latin) words to randomly chosen Finnish words
(taken from kotus.fi;
apparently there are some… not-so-beautiful words in there as well, sorry about that).
The article by
Korn and Klug (2003)
might contain useful information at least about how some of the columns
are related to each other.
• The fourth data set is based on the
CoIL 2000
challenge. See the linked web page for plenty of useful information. Your data is based on the file
ticdata2000.txt; more specifically, the actual data file contains all columns except the last one
(class label), and the actual labels are given as side information. The rows of caravan.dat and
caravan.sdat correspond to the same people.
• Download the project template (includes the 1st data set)
• All data sets (2014-12-04).
Contains the new data set for the 5rd round (note also the side information),
including the new penalty/reward data files.
To run the project, untar the package by 'tar xzf itmProject.tgz', go to the directory itmProject, and say 'make test'. See also the README file. If you run into problems, email questions to both the instructor and the teaching assistant.
The performance of your algorithm will be evaluated with the given
set of data files. The loss-function is given by:
where N is the number of data files,
is the size of the decompressor program, and
is the size of the compressed version of the ith file.
The competition is iterative, meaning that there are several deadlines. After each deadline your success is evaluated and scores are published. After each round you are also allowed to submit your own data which will be used in the competition on the remaining rounds. Each group is allowed a variable amount of data, the amount depending on your success on earlier rounds. Because you can probably compress your own data better than other groups it pays off to try to do well already on the first rounds. Project points will be based only on the final round scores though. Details of the data submission procedure are annouced later.
![]() |
October 29 First meeting (B119); participation recommended.
November 5 Second meeting (B119): Problems? No worries, let's work them out. November 5 First data updated and project template released. November 10, 4pm First deadline. November 12 First round results announced. Q&A November 18, 4pm Second deadline. Late submissions will be penalized. November 20, 3:30pm Deadline for submitting penalty data for the 3rd round. November 25, 11:59pm Deadline for solutions for the 3rd round. November 27, 3:30pm Deadline for submitting penalty/reward data for the 4th round. December 2, 11:59pm Deadline for solutions for the 4th round. December 15 (midnight) Project Report deadline. Being late will result in point reductions (10% per 24h). March 5 The final project reports (along with the accompanying source code) are now public! Just click on the group names on the final leaderboard. |
University of Helsinki | Department of Computer Science | Helsinki Institute for Information Technology |
![]() |