Run-Length Compressed Suffix Array

This web page will no longer be updated. See the author's web pages for further releases.

RLCSA [4, 3, 7] is a compressed suffix array implementation that has been optimized for highly repetitive text collections. Examples of such collections include version control data and individual genomes. This implementation also serves as a testbed for many techniques used with compressed suffix arrays. The most up-to-date description of RLCSA can be found in [7].

The current version includes experimental support for:

LCP information [5]
Distribution-aware sampling [6, 9]
Adaptive distribution-aware sampling
Space-efficient document listing [8]

See README in the package for further information. The implementation is available for download under the MIT / X11 License.

There is also a separate package containing implementations of additional document retrieval algorithms:

Brute-force document listing and top-k document retrieval.
New variants of PDL for document listing and top-k document retrieval [10].
Several variants of Sadakane's algorithm for document listing.

Full experimental results from [10] are also included in the document listing package.

News

2014-07-23 This web page will no longer be updated. See the author's web pages for further releases.
2014-04-18 An updated version of the document listing package, with a minor bug fix to RLCSA to support it.
2014-01-11 A new version of the document listing package, and some new datasets from [10].
2014-01-10 A new version with a number of technical changes. Required by January 2014 version of the doclist package.
2013-05-21 A separate package with some additional document listing algorithms.
2013-05-20 A new version with improved document listing support.
2013-04-10 A new version with document listing support.
2013-01-21 A new version with parallel sampling.
2012-12-07 A new version with various changes accumulated over time.
2012-11-05 More test data is now available online.
2012-10-12 More test data is now available online.
2012-02-14 More space-efficient construction. Option to use a succinct bit vector to mark sampled positions.
2011-08-23 A minor update to support the August 2011 version of GCSA.
2011-05-17 Support for distribution-aware sampling and an improved low-level interface for external modules.
2011-01-17 A new version that supports the interface used in recent GCSA experiments.
2010-11-25 Added a list of the data sets used in the experiments.
2010-10-13 A new version with libstdc++ parallel mode support and performance improvements.
2010-03-29 A new version with some compatibility and interface updates.
2010-01-11 A new version with experimental support for LCP information.
2009-11-25 A new version of RLCSA is available. This version includes bug fixes, more functionality and a bit cleaner interface.
2009-06-15 The implementation of RLCSA is now available.

Downloads

Current versions

RLCSA (April 2014)
Document listing (April 2014)

RLCSA

Document listing

April 2014. A new variant of top-k PDL. Some technical updates and bug fixes. This package was used for the experiments in [10]. The package also includes full experimental results from [10]. Requires April 2014 version of RLCSA.
January 2014. A major update for the experiments in [10]. Requires January 2014 version of RLCSA.
May 2013. This is the initial release of the document listing package. The ILCP implementation is a rewrite of the one used in [8]. The package requires May 2013 version of RLCSA.

Data

The following data sets were used in experiments with RLCSA. Some of them are available from the Pizza&Chili Corpus (Chile, Italy).

DBLP archives and queries (available upon request) [6 (description), 9].
DNA sequences (with synthetic mutations in some cases) [1 (description), 2, 4, 5, 7, 10].
English language texts [5, 7, 10].
Finnish language words [10].
Human reference genome (NCBI build 34) [3].
Influenza genomes from NIH [8, 10].
Source code for OpenSSH version 4.7p1 and versions up to that [1, 2, 4].
Protein sequences [3].
Another set of protein sequences [10].
Saccharomyces cerevisiae and Saccharomyces paradoxus genomes from the Wellcome Trust Sanger Institute [2 (description), 4, 7].
Web queries (available upon request) and pages [6 (description), 9, 10].
Wikipedia archives: fiwiki, enwiki [3 (description), 5, 7, 8 (description), 10].

The full test environment from [10] is also available upon request. Note that setting it up requires much manual work, and some of the implementations use 32-bit libraries.

References

Jouni Sirén, Niko Välimäki, Veli Mäkinen, and Gonzalo Navarro: Run-Length Compressed Indexes Are Superior for Highly Repetitive Sequence Collections.
Proc. SPIRE 2008, Springer LNCS 5280, pp. 164-175, Melbourne, Australia, November 10-12, 2008.

[Article] [Preprint]
Veli Mäkinen, Gonzalo Navarro, Jouni Sirén, and Niko Välimäki: Storage and Retrieval of Individual Genomes.
Proc. RECOMB 2009, Springer LNCS 5541, pp. 121-137, Tucson, Arizona, USA, May 18-21, 2009.

[Article] [Preprint]
Jouni Sirén: Compressed Suffix Arrays for Massive Data.
Proc. SPIRE 2009, Springer LNCS 5721, pp. 63-74, Saariselkä, Finland, August 25-27, 2009.

[Article] [Preprint]
Veli Mäkinen, Gonzalo Navarro, Jouni Sirén, and Niko Välimäki: Storage and Retrieval of Highly Repetitive Sequence Collections.
Journal of Computational Biology 17(3):281-308, 2010.

[Article] [Preprint] Conference versions: [1, 2]
Jouni Sirén: Sampled Longest Common Prefix Array.
Proc. CPM 2010, Springer LNCS 6129, pp. 227-237, New York, USA, June 21-23, 2010.

[Article] [Full version] [Preprint]
Paolo Ferragina, Jouni Sirén, and Rossano Venturini: Distribution-Aware Compressed Full-Text Indexes.
Proc. ESA 2011, Springer LNCS 6942, pp. 760-771, Saarbrücken, Germany, September 5-7, 2011.

[Article] [Preprint]
Jouni Sirén: Compressed Full-Text Indexes for Highly Repetitive Collections (PhD Thesis).
Department of Computer Science, Series of Publications A, Report A-2012-5, University of Helsinki, June 2012.
[Thesis]
Travis Gagie, Kalle Karhu, Gonzalo Navarro, Simon J. Puglisi, and Jouni Sirén: Document Listing on Repetitive Collections.
Proc. CPM 2013, Springer LNCS 7922, pp. 107-119, Bad Herrenalb, Germany, June 17-19, 2013.
[Article]
Paolo Ferragina, Jouni Sirén, and Rossano Venturini: Distribution-aware compressed full-text indexes.
Algorithmica 67(4):529-546, 2013.
[Article] Conference version: [6]
Gonzalo Navarro, Simon J. Puglisi, and Jouni Sirén: Document Retrieval on Repetitive Collections.
Accepted to ESA 2014.
[Preprint]

Jouni.Siren@cs.helsinki.fi