Parallel external memory suffix array construction
[back to the main page]We present a C++ implementation of the first parallel external memory suffix array construction algorithm. The basic idea of the algorithm is to first construct the suffix arrays for blocks of text and then merge them into the full suffix array.
- The algorithm is able to handle inputs much larger than considered in the literature so far. In our experiments, computing the suffix array of a 1TiB file with the new algorithm on a machine equipped with 120GiB of RAM took a little over a week and required only 7.2TiB of disk space (including input text and output suffix array), whereas on the same machine the previously best algorithm would require 3.5 times as much disk space and take about four times longer.
- The algorithm is able to fully utilize the available RAM, for example, computing the suffix array of a 200GiB file using 3.5GiB of RAM takes around 170h. With 120GiB of RAM, the computation time is reduced to less than 12h.
Details of the algorithm and experimental evaluation can be found in
[1].
For more information, see README provided with the package.
Downloads
- pSAscan 0.1.0 (current version)
Implemented by
- Juha Kärkkäinen
- Dominik Kempa
See also
- Lightweight external memory suffix array construction algorithm
- Better external memory (P)LCP array construction
- LCP array construction in external memory
- String sorting algorithms
References
-
Juha Kärkkäinen, Dominik Kempa, Simon J. Puglisi.
Parallel External Memory Suffix Sorting.
In Proc. 26th Symposium on Combinatorial Pattern Matching (CPM 2015),
Springer, 2015, pp. 329-342.
[Springer]