First Huffman, then Burrows-Wheeler: A Simple Alphabet-Independent FM-Index
Szymon Grabowski, Veli Mäkinen, and Gonzalo Navarro
Abstract.
We design a succinct full-text index based on the idea of Huffman-compressing
the text and then applying the Burrows-Wheeler transform over it. The resulting
structure can be searched as an FM-index, with the benefit of removing the sharp
dependence on the alphabet size, c, present in that structure. On a text
of length n with zero-order entropy H_0, our index needs
O(n(H_0+1)) bits
of space, without any dependence on c. The average search time for a
pattern of length m is O(m(H_0+1)), under reasonable assumptions. Each
position of a text occurrence can be reported in worst case time
O((H_0+1) log n), while any text substring of length L can be retrieved in
O((H_0+1)L) average time in addition to the previous worst case time. By
paying 2n additional bits of space, it is possible to maintain the average
complexities and ensure that H_0+1 becomes log c in the worst case
for all the time complexities. Our index provides a relevant space/time
tradeoff between existing succinct data structures, with the additional
interest of being easy to implement.