Succinct Suffix Arrays based on Run-Length Encoding.
Veli Mäkinen and Gonzalo Navarro
A succinct full-text self-index is a data structure built on a text
T=t_1 t_2 ... t_n, which takes little space (ideally close to that of the
compressed text), permits efficient search for the occurrences of a pattern
P=p_1 p_2 ... p_m in T, and is able to reproduce any text substring, so
the self-index replaces the text.
Several remarkable self-indexes have been developed in recent years. They
usually take space proportional to n H_0 or n H_k bits, where H_k is the
kth order empirical entropy of T. The time to count how many times does
P occur in T ranges from O(m) to O(m log n).
In this paper we present a new self-index, called RLFM index for "run-length
FM-index", that counts the occurrences of P in T in O(m) time when the
alphabet size is sigma=O(polylog(n)). The RLFM index requires
n H_k log_2(sigma)+O(n) bits of space for small k.
We then show how to implement the RLFM index in practice, and obtain in passing
another implementation with different space-time tradeoffs. We empirically
compare ours against the best existing implementations of other indexes and
show that ours are fastest among indexes taking less space than the text.