Project I: Compressing
Coin Flips
Marija Furdek, Michael Duku Kaakyire, Anupam Arohi
Introduction
Coin flips are
encoded as binary strings and are stored in files. We use this binary
representation as a symbol and apply algorithms on symbols (1 byte each). As a
consequence the algorithm works on character symbols and not on individual
bits. Due to this reason we were unable to address the biassed version of
compression.
High-level description
Solution is
implementation of adaptive arithmetic coding. The general concept of arithmetic
coding is dividing the probabilities interval [0,1) into divisions whose size
is exactly proportional to the percentage of every given symbol. After each
step, according to the current symbol, the range is updated using the
algorithm:
Set low to 0.0
Set high to 1.0
While there are still input symbols do
get an input symbol
code_range = high - low.
high = low + range*high_range(symbol)
low = low + range*low_range(symbol)
End of While
output low
The encoding algorithm has two
parts: the probability model and the arithmetic encoder. The model reads the
next symbol from the input stream and invokes the encoder, sending it the
symbol and the two required cumulative frequencies. The model then increments the
count of the symbol and updates the cumulative frequencies. The point is that
the symbol’s probability is determined by the model from its old count, and the
count is incremented only after the symbol has been encoded. This makes it
possible for the decoder to mirror the encoder’s operations. The encoder knows
what the symbol is even before it is encoded, but the decoder has to decode the
symbol in order to find out what it is. The decoder can therefore use only the
old counts when decoding a symbol.
Once the symbol
has been decoded, the decoder increments its count and updates the cumulative
frequencies in exactly the same way as the encoder. In nutshell, the basic operations
used by the compressor are:
·
Interval update and arithmetic operations
·
Symbol encoding (interval search)
·
Interval renormalization
·
Carry propagation and bit moves
·
Probability estimation(source modeling)
We faced
challenges in scaleng the algorithm to work with medium to large size files.
Basic method runs out of precision very soon. It is optimized using various
methods, one of which is shifting high bits regularly and maintaining infinite
precision. Most significant digit is shifted out after it becomes the same for
the lower and the upper bound, and then shifting the rest of the digits to the
left. For each symbol we then calculate
the number of occurences, and, according to its probability, we assign it
index, high, low, high cumulative frequency and low cumulative frequency.
Analysis and the results of the tests
Table below gives the results of the
application test on given test files:
File |
Original Size |
Compressed File Size |
Ratio |
binom1.dat |
1000 |
5656 |
1.77:1 |
binom2.dat
|
1000 |
8108 |
1.23:1 |
binom3.dat
|
1000 |
9355 |
1.07:1 |
binom4.dat
|
1000 |
9984 |
1.00:1 |
binom5.dat
|
1000 |
10176 |
0.98:1 |
binom6.dat
|
1000 |
9986 |
1.00:1 |
binom7.dat
|
1000 |
9345 |
1.07:1 |
binom8.dat
|
1000 |
8070 |
1.24:1 |
binom9.dat
|
1000 |
5764 |
1.73:1 |
We observed that for certain cases,
it expanded the file in place of compressing it. One obvious solution to this
(not implemented) is to calculate the compression ratio and avoid compression
when compression is not fruitful. There could be multiple strategy
implementations to take advantages of known binary patterns viz. Binary pattern
repetition, p function etc. We learned a lot on various approached
that can be take and adopted. However with our limited technical expertise and
time, these thoughts did not manifest.
Application files:
Source:
1. arithmetic_codec.h
2. arithmetic_codec.cpp
3. acfile.cpp
binary:
aczip
usage:
compress:
aczip –c <FileToBeCompressed>
<CompressedFileName>
decompress:
aczip –d <CompressedFileName>
<ExpandedFileName>
Refferences
David Salomon, Data Compression the Complete Reference 3rd
Ed, pages 106-124
Mark Nelson, Arithmetic Coding +
Statistical Modeling = Data Compression