Project in Information-Theoretic Modeling

582651 Project in Information-Theoretic Modeling (2 cr)
Department of Computer Science, University of Helsinki

This is a project following the course Information-Theoretic Modeling. The course is a pre-requisite so you either need to have taken the course, or explain convincingly where you have learned the equivalent skills.

Leaderboard (final results), updated 2014-12-18 (view "fullscreen")

Instructor: Teemu Roos, A332, teemu.roos at cs.helsinki.fi
Teaching assistant: Jussi Määttä, jussi.maatta at helsinki.fi

Language: All course material and lectures will be in English.

Time: Period II: Oct 29–Dec 10, Wed ~~10–12~~ 11:15–12:00 in B119.

We will have regular meetings on Wednesdays to discuss your progress so far. The meetings are not obligatory unless otherwise stated.

Prerequisites:

The Information-Theoretic Modeling course
Good programming skills (no specific language required)

Groups: The project will be done in groups of size 2-3 persons. Please let us know if you need help with finding a group.

Task

Compress data. We will give the files to compress, your task is to squeeze them into as small as possible. The catch is that we also count the size of the decompressor program into the total score.

Your task is to design, document, and implement a compressor and a decompressor program. All imaginable lossless compression methods are allowed. The compression rate of the program will be evaluated with a given set of data files. However, scoring will not be based solely on compression rate. Instead the size of the decompressor program is added to the size of the compressed files. Using too complex a decompressor program (e.g., a la style ''print "data...";'') will not generally be a good idea.

As a special feature, in some cases both the compressor and the decompressor programs have access to side information related to the file to be compressed. Using the side information is not obligatory, but it makes compression significantly easier, thus reducing the size of the compressed file.

Deliverables

Your documentation, due December 15 (midnight), should describe the ideas/observations/principles underlying the compressor and the decompressor algorithms. A complete project is a tar-file (all links should be relative and all files in the directory returned) including

• high-level description (in English) of the algorithms,
• program source for both compressor and decompressor programs.

The documentation can be delivered as a separate tar-file that can be extracted in the same directory as the programs.

You can use any programming language you like, as long as we can run your program on the CS department Linux workstations. You can deliver source code or binaries or a combination of both — as you find best — but it must be possible to run the whole think using the script decompress.sh, as described in the project template.

The form of the report is a WEB-page (the HTML template is included in the project template).

• Download the first data set (a sequence of 200 integers)
• Download the second data set (paleontological data)
This data is published here solely for the use of the participants of this course. The first column has been "anonymized" by injectively mapping the original (mostly Latin) words to randomly chosen Finnish words (taken from kotus.fi; apparently there are some… not-so-beautiful words in there as well, sorry about that). The article by Korn and Klug (2003) might contain useful information at least about how some of the columns are related to each other.
• The fourth data set is based on the CoIL 2000 challenge. See the linked web page for plenty of useful information. Your data is based on the file ticdata2000.txt; more specifically, the actual data file contains all columns except the last one (class label), and the actual labels are given as side information. The rows of caravan.dat and caravan.sdat correspond to the same people.
• Download the project template (includes the 1st data set)
• All data sets (2014-12-04). Contains the new data set for the 5rd round (note also the side information), including the new penalty/reward data files.

To run the project, untar the package by 'tar xzf itmProject.tgz', go to the directory itmProject, and say 'make test'. See also the README file. If you run into problems, email questions to both the instructor and the teaching assistant.

Evaluation

The performance of your algorithm will be evaluated with the given set of data files. The loss-function is given by:
Loss := LD + LE_1 + ... + LE_N,
where N is the number of data files, is the size of the decompressor program, and LE_i is the size of the compressed version of the ith file.

The competition is iterative, meaning that there are several deadlines. After each deadline your success is evaluated and scores are published. After each round you are also allowed to submit your own data which will be used in the competition on the remaining rounds. Each group is allowed a variable amount of data, the amount depending on your success on earlier rounds. Because you can probably compress your own data better than other groups it pays off to try to do well already on the first rounds. Project points will be based only on the final round scores though. Details of the data submission procedure are annouced later.

Frequently Asked Questions

Can I ask something?: Yes, sure. The easiest way is to use Piazza. We are using the same forum as on the course. Please ask Jussi to give you access if necessary.
How big can the groups be?: Groups shouls be 2–3 persons.
Can we use existing compressors such as gzip or bzip in the script?: Yes but then you must explain how they work in the report.
Can I use — library?: You can use it if it is available in the standard Computer Science Department Linux system. Again, if it is related to compression, you'll need to explain it in the report.
Can my decompress.sh script access the network?: No.
How do we deliver our solution?: Say "make tar" to create a compressed tar file, and send it to Jussi by e-mail. Remember to keep only the required files in folder decompressor. Also check that there are no hidden files (files whose names start with a dot).
Should we deliver source code or binaries in folder decompressor?: You can decide — whichever is smaller is better. The decompress.sh can include compilation of source code into binary as a preliminary step if you prefer to deliver source code. In either case, remember to deliver the source code in folder src.
How about the project report?: You should write the project report as an HTML page (index.html in the template). You only need to deliver the final version of the report after the final deadline, by December 15.

Additional Material

Compression algorithms

Witten, Neal & Cleary, "Arithmetic coding for data compression", Communications of the ACM, Vol. 30, No 6, pp. 520–540, 1987.

Cover & Thomas, Elements of Information Theory, Chapter 5: (see course folder, C127)

Lelewer & Hirschberg, "Data compression", ACM Computing Surveys, Vol. 19, No. 3, pp. 261–296, 1987.

David Solomon, Data Compression: The Complete Reference, Springer, 2004, 926 pages: the title says it all: a comprehensive reference; more recent editions available

October 29 First meeting (B119); participation recommended.

November 5 Second meeting (B119): Problems? No worries, let's work them out.

November 5 First data updated and project template released.

November 10, 4pm First deadline.

November 12 First round results announced. Q&A

November 18, 4pm Second deadline. Late submissions will be penalized.

November 20, 3:30pm Deadline for submitting penalty data for the 3rd round.

November 25, 11:59pm Deadline for solutions for the 3rd round.

November 27, 3:30pm Deadline for submitting penalty/reward data for the 4th round.

December 2, 11:59pm Deadline for solutions for the 4th round.

December 15 (midnight) Project Report deadline. Being late will result in point reductions (10% per 24h).

March 5 The final project reports (along with the accompanying source code) are now public! Just click on the group names on the final leaderboard.

University of Helsinki | Department of Computer Science | Helsinki Institute for Information Technology

582651 Project in Information-Theoretic Modeling (2 cr) Department of Computer Science, University of Helsinki