Project I: Compressing Coin Flips

Marija Furdek, Michael Duku Kaakyire, Anupam Arohi

Introduction

Coin flips are encoded as binary strings and are stored in files. We use this binary representation as a symbol and apply algorithms on symbols (1 byte each). As a consequence the algorithm works on character symbols and not on individual bits. Due to this reason we were unable to address the biassed version of compression.

High-level description

Solution is implementation of adaptive arithmetic coding. The general concept of arithmetic coding is dividing the probabilities interval [0,1) into divisions whose size is exactly proportional to the percentage of every given symbol. After each step, according to the current symbol, the range is updated using the algorithm:

Set low to 0.0

Set high to 1.0

While there are still input symbols do

    get an input symbol

    code_range = high - low.

    high = low + range*high_range(symbol)

    low = low + range*low_range(symbol)

End of While

output low

The encoding algorithm has two parts: the probability model and the arithmetic encoder. The model reads the next symbol from the input stream and invokes the encoder, sending it the symbol and the two required cumulative frequencies. The model then increments the count of the symbol and updates the cumulative frequencies. The point is that the symbol’s probability is determined by the model from its old count, and the count is incremented only after the symbol has been encoded. This makes it possible for the decoder to mirror the encoder’s operations. The encoder knows what the symbol is even before it is encoded, but the decoder has to decode the symbol in order to find out what it is. The decoder can therefore use only the old counts when decoding a symbol.

Once the symbol has been decoded, the decoder increments its count and updates the cumulative frequencies in exactly the same way as the encoder. In nutshell, the basic operations used by the compressor are:

· Interval update and arithmetic operations

· Symbol encoding (interval search)

· Interval renormalization

· Carry propagation and bit moves

· Probability estimation(source modeling)

We faced challenges in scaleng the algorithm to work with medium to large size files. Basic method runs out of precision very soon. It is optimized using various methods, one of which is shifting high bits regularly and maintaining infinite precision. Most significant digit is shifted out after it becomes the same for the lower and the upper bound, and then shifting the rest of the digits to the left. For each symbol we then calculate the number of occurences, and, according to its probability, we assign it index, high, low, high cumulative frequency and low cumulative frequency.

Analysis and the results of the tests

Table below gives the results of the application test on given test files:

File	Original Size	Compressed File Size	Ratio
binom1.dat	1000	5656	1.77:1
binom2.dat	1000	8108	1.23:1
binom3.dat	1000	9355	1.07:1
binom4.dat	1000	9984	1.00:1
binom5.dat	1000	10176	0.98:1
binom6.dat	1000	9986	1.00:1
binom7.dat	1000	9345	1.07:1
binom8.dat	1000	8070	1.24:1
binom9.dat	1000	5764	1.73:1

We observed that for certain cases, it expanded the file in place of compressing it. One obvious solution to this (not implemented) is to calculate the compression ratio and avoid compression when compression is not fruitful. There could be multiple strategy implementations to take advantages of known binary patterns viz. Binary pattern repetition, p function etc. We learned a lot on various approached that can be take and adopted. However with our limited technical expertise and time, these thoughts did not manifest.

Application files:

Source:

1. arithmetic_codec.h

2. arithmetic_codec.cpp

3. acfile.cpp

binary:

aczip

usage:

compress:

aczip –c <FileToBeCompressed> <CompressedFileName>

decompress:

aczip –d <CompressedFileName> <ExpandedFileName>

Refferences

David Salomon, Data Compression the Complete Reference 3rd Ed, pages 106-124

Mark Nelson, Arithmetic Coding + Statistical Modeling = Data Compression