Succinct Suffix Arrays based on Run-Length Encoding
Veli Mäkinen and Gonzalo Navarro
A succinct full-text self-index is a data structure built on a text
T of length n, which takes little space (ideally close to that of the
compressed text), permits efficient search for the occurrences of a pattern
P in T, and is able to reproduce any text substring, so
the self-index replaces the text.
Several remarkable self-indexes have been developed in recent years. Many
of those take space proportional to nH_0 or nH_k bits, where H_k is the
k-th order empirical entropy of T. The time to count how many times does
P occur in T ranges from O(m) to O(m log n).
In this paper we present a new self-index, called RLFM index for "run-length
FM-index", that counts the occurrences of P in T in O(m) time when the
alphabet size is c=O(polylog(n)). The RLFM index requires
nH_k log c + O(n) bits of space, for any k <= a log_c n and
constant 0 < a < 1 . Previous indexes that achieve O(m) counting time either
require more than nH_0 bits of space or require that c=O(1). We also
show that the RLFM index can be enhanced to locate occurrences in the text and
display text substrings in time independent of c.
In addition, we prove a close relationship between the k-th order entropy of
the text and some regularities that show up in their suffix arrays and in the
Burrows-Wheeler transform of T. This relationship is of independent interest
and permits bounding the space occupancy of the RLFM index, as well as that of
other existing compressed indexes.
Finally, we present some practical considerations in order to implement the
RLFM index. We empirically compare our index against the best existing
implementations and show that it is practical and competitive against those.
In passing, we obtain a competitive implementation of an existing theoretical
proposal that can be seen as a simplified RLFM index, and explore other
practical ideas such as Huffman-shaped wavelet trees.