Lempel-Ziv (LZ) parsing is a central algorithm in data compression, at the core of
tools like LZ4, Zstandard, snappy, gzip, p7zip, among many others.
The LZ parsing achieves compression by finding previous occurrences of a text substrings (``source''),
and replacing the current occurrence (``target'') by a backward pointer to the source.
When the parsing is done greedily, the number of phrases is guaranteed to be minimized.
We have developed ReLZ, a memory-efficient variant for very large files that approximates the greedy parsing doing a two-steps parsing:
The first iteration only replaces substrings with a source in a short prefix of the text, generating a intermediate sequence much smaller than the original input. The second step consists on a greedy Lempel-Ziv parsing that further compresses the input.
The aim of this project is to turn our parser into an efficient compressor:
that includes the study and development of different encoding schemes for the backward pointers, and also the study of different variants of
the parser itself that can lead to better compression ratios: for instance, one can decide to encode only longer phrases as pointer, and short
phrases as literal strings.