3 Concepts: Information | Projects

Project II: De Compression Competition

Deadline(s)

1) April 15, 24:00
2) April 24, 24:00
3) April 29, 24:00
Final) May 6, 24:00
Documentation) May 14, 9:00

Results

R O U N D    1
 TOTAL BYTES
1. Team Heino 20'903
2. Team Kääriäinen 22'567
3. Team Haapasalo 28'088
4. Team Päiväniemi 34'272
5. Team Aunimo 35'552
6. Team Mäkelä 42'256

For Round 2, the curve1 and curve2 data sets have been enlarged and a new data set, yeast, has been introduced. In addition, the data set extra contains data submitted by Team Heino (1475 bytes) and Team Kääriäinen (643 bytes).

R O U N D    2
 TOTAL BYTES
1. Team Heino 139'440
2. Team Kääriäinen 145'897
3. Team Mäkelä 153'482
4. Team Haapasalo 156'586
5. Team Päiväniemi 185'395
6. Team Aunimo -

Team Heino and Team Kääriäinen have kept the lead. For Round 3 they will submit 3416 and 188 bytes, respectively. No new data sets for Round 3.

R O U N D    3
  DEC.  curve1   curve2   extra    info    letter  mush.  yeast  TOTAL
1. Team Heino 3'045 12'374 13'114 189 669 1'167 1 107'068 137'627
2. Team Kääriäinen 1'868 13'449 14'303 3'618 698 3'075 774 108'071 145'856
3. Team Mäkelä 2'914 13'286 13'947 3'605 483 6'405 1 108'006 148'647
4. Team Haapasalo 1'921 15'013 15'296 3'608 1'187 5'937 622 107'960 151'544
5. Team Päiväniemi 539 22'297 21'726 3'615 668 12'568 1'397 124'071 186'881
6. Team Aunimo 805 22'427 22'041 3'627 707 12'580 1'375 127'861 191'423

Team Heino is still going on strong (3208 bytes of data submitted). There seems to be a good race for the second place. For the Final Round, the (largish) data set forest has been introduced. We'll see if it changes the ranking. Also, detailed results may stir the pot.

F I N A L     R O U N D
  DEC.  curve1   curve2   extra   forest    info    letter  mush.  yeast  TOTAL
1. Team Heino 3'548 12'374 13'114 1 21'159 669 1'167 1 107'068 159'101
2. Team Kääriäinen 3'756 12'615 13'394 3'209 21'366 696 2'008 0 107'356 164'400
3. Team Mäkelä 3'077 13'205 13'867 3'209 22'088 502 6'137 1 107'049 169'135
4. Team Haapasalo 2'384 14'315 15'201 3'209 24'196 708 5'362 1 107'953 173'329
5. Team Aunimo 4'841 16'263 16'272 3'232 32'705 709 12'583 21 107'749 194'375
6. Team Päiväniemi 539 22'297 21'726 3'219 29'001 668 12'568 1'397 124'071 215'486

No changes in the top ranks. Team Heino didn't improve their performance but the lead was enough to keep to the first place, even though Team Kääriäinen did catch up a lot. The best improvement compared to Round 3 was made by Team Aunimo. Teams Heino and Kääriäinen will reveal the secret of their success at the Seminar.

The task

Your task is to design, document, and implement a compressor and a decompressor program. All imaginable lossless compression methods are allowed. The compression rate of the program will be evaluated with a given set of data files. However, scoring will not be based solely on compression rate. Instead the size of the decompressor program is added to the size of the compressed files. Using too complex a decompressor program (e.g., a la style ''print "data...";'') will not generally be a good idea.

As a special feature, in some cases both the compressor and the decompressor programs have access to side information related to the file to be compressed. Using the side information is not obligatory, but it makes compression significantly easier, thus reducing the size of the compressed file.

Deliverables

Your documentation, due May 14, should describe the ideas/observations/principles underlying the compressor and the decompressor algorithms. A complete project is a tar-file (all links should be relative and all files in the directory returned) including

» high-level description (in English) of the algorithms,
» program source for both compressor and decompressor programs.

The documentation can be delivered as a separate tar-file that can be extracted in the same directory as the programs.

In addition, two best groups will be giving a presentation about the basic ideas of your approach at the course seminar. Because the final results will be available only a short time before the seminar, all groups should be preliminarily prepared for giving a presentation.

In order to make the sizes of the programs comparable, programming language is Java for all groups (sorry about that). The programs must compile in Linux with the Makefile given in the project template. The usage of the programs must conform with the test script also found in the template. In particular, input is read from the standard input stream and a side information file specified on the command line (e.g., ''java mydecompress sideinfo.sdat <compressed.data >decompressed.data''). See the README file.

The form of the report is a WEB-page (the HTML template is included in the project template).

If you run into problems, email questions to Teemu.

Evaluation

The performance of your algorithm will be evaluated with the given set of data files. The loss-function is given by:

Loss := LD + LE_1 + ... + LE_N,




where N is the number of data files, LD is the size of the decompressor program, and LE_i is the size of the compressed version of the ith file.

The competition is iterative, meaning that there are several deadlines. After each deadline your success is evaluated and scores are published. After each round you are also allowed to submit your own data which will be used in the competition on the remaining rounds. Each group is allowed a variable amount of data, the amount depending on your success on earlier rounds. Because you can probably compress your own data better than other groups it pays off to try to do well already on the first rounds. Project points will be based only on the final round scores though. Details of the data submission procedure are annouced later.

Frequently Asked Questions

1. There is a bug in the mycompress and mydecompress scripts.

Yes, you're right. Instead of ''java -classpath cclass mycompress'' it should be ''java -classpath cclass mycompress $1''. And the same for mydecompress. I fixed the project template. (Thanks Jukka.)

2. Can the decompression program use other input than stdin and the side information file given on the command line?

No.

3. Can I use Java standard library java.___.___?

Yes.

4. Do I have use to Java standard library java.___.___? My class file gets too big.

No.

5. Making the class files small is a question of good/bad/ugly Java programming skills, not compression!

Unfortunately this is more or less the case. However, we expect the differences due to compression rate to be more significant than the differences due to programming style. If this turns out to be a problem, we will increase the sizes of the data files or think of something else. Besides, think of the bright side of things: We first thought of comparing the sizes of the source files! (See IOCCC.)

6. There is no point in using data from the other groups. They will be practically impossible to compress.

Yes, that's the idea. But don't worry, the submitted data files will be smaller than the ones given by us.

7. What are we supposed to do after the first round? Compressing other groups' data will be almost impossible, so there's no point in trying, and we will have done all the work by the first round.

That's the right attitude. Well, we didn't say that we won't add other data files ourselves.

8. Can the compressor and decompressor programs create and access temporary files in /tmp or elsewhere?

Yes.

9. Do the programs have to be general purpose applications, or does it suffice to be able to handle the given files?

Just the given files, with the given script (runtest.sh).

 

 3 Concepts: Information
2002