An Alphabet-Friendly FM-Index
Paolo Ferragina, Giovanni Manzini, Veli Mäkinen, and Gonzalo Navarro
Abstract.
We show that, by combining an existing compression boosting technique
with the wavelet tree data structure, we are able to design a
variant of the FM-Index which scales well with the size of the input
alphabet A. The size of the new index built on a string T[1,n] is
bounded by n
H_k(T) + O((n log log n)/ log_{|A|} n) bits, where
H_k(T) is the k-th order empirical entropy of T.
The above bound holds simultaneously for
all k<=a log_{|A|} n and 0< a < 1. Moreover,
the index does not depend on the parameter k, which plays a role
only in analysis of the space occupancy.
Using our index, counting the occurrences of an arbitrary pattern P[1,p] as
a substring of T takes O(p log |A|) time. Locating each pattern occurrence
takes O(log |A| log^{1+e} n) time.
Reporting a text substring of length l takes
O((l + log^{1+e} n) log |A|) time.