An Alphabet-Friendly FM-Index

Paolo Ferragina, Giovanni Manzini, Veli Mäkinen, and Gonzalo Navarro

Abstract. We show that, by combining an existing compression boosting technique with the wavelet tree data structure, we are able to design a variant of the FM-Index which scales well with the size of the input alphabet A. The size of the new index built on a string T[1,n] is bounded by n H_k(T) + O((n log log n)/ log_{|A|} n) bits, where H_k(T) is the k-th order empirical entropy of T. The above bound holds simultaneously for all k<=a log_{|A|} n and 0< a < 1. Moreover, the index does not depend on the parameter k, which plays a role only in analysis of the space occupancy. Using our index, counting the occurrences of an arbitrary pattern P[1,p] as a substring of T takes O(p log |A|) time. Locating each pattern occurrence takes O(log |A| log^{1+e} n) time. Reporting a text substring of length l takes O((l + log^{1+e} n) log |A|) time.