GCSA

This web page will no longer be updated. See the author's web pages for further releases.

GCSA [2, 3, 4] is a generalised compressed suffix array for finite languages a.k.a. labeled directed acyclic graphs (labeled DAGs). The implementation now supports general alphabet and multiple automata. The most up-to-date description of GCSA can be found in [3,4].

See README in the package for further information.

The implementation is available for download under the MIT / X11 License. Our implementation of RLCSA [1, 3] is required for compiling GCSA. (The current version of GCSA should always work with the current version of RLCSA.)

News

2014-07-23 This web page will no longer be updated. See the author's web pages for further releases.
2014-01-02 Wrapper scripts to pre-process VCF data and align paired-end reads are now available for download. See [4] for more details.
2013-12-19 Extended version of [2] accepted to IEEE/ACM Transactions in Computational Biology and Bioinformatics [4].
2013-10-26 New data and a prebuilt index available for download.
2013-05-20 A new bug fix release.
2012-12-07 A new version with support for general alphabet and multiple automata.
2012-06-11 Information about the datasets used in the experiments.
2012-02-14 Simpler and faster index. More space-efficient construction.
2011-08-23 Vastly improved construction algorithm.
2011-01-17 A new version that is significantly faster, and supports approximate searching.
2010-10-13 The implementation of GCSA is now available.

Downloads

Current version (May 2013).
May 2013. This bug fix release allows temporary automata with more than 2³² nodes. Backbone, multiple automata, and determinization now work correctly together. Better stabilization prediction and slightly smaller memory usage for index construction. Requires May 2013 version of RLCSA.
December 2012. GCSA now supports general alphabet and multiple automata. There is also a program for merging multiple automata into a single file, making them reverse deterministic when necessary. Requires December 2012 version of RLCSA.
February 2012. A simpler version of GCSA that is theoretically 2x (instead of 3x) slower than a regular CSA. Construction uses in-place sorting that is more space-efficient but slightly slower. Requires February 2012 version of RLCSA. This version was used in the experiments in [3].
August 2011. This implementation was used in the full version of [2]. Contains a vastly improved construction algorithm. Requires August 2011 version of RLCSA.
January 2011. A new improved version with faster basic operations and support for approximate searching. This version was used in [2]. Requires January 2011 version of RLCSA.
October 2010. The original version. Works with RLCSA (October 2010).

Additional downloads

2014-01-02 Wrapper scripts to pre-process VCF data and align paired-end reads are now available for download. See the README file and [4] for more details.
The alignment of the four assemblies of human chromosome 18 is available for download. See [2] for a description of the alignment. The read set used in the experiments in [2, 3] is not publicly available.
Automata and multiple alignments built from the human reference genome (GRCh37.p5) and the Finnish subset of frequent mutations from dbSNP database. Experiments on this data are reported in [4].

Prebuilt indexes

Building GCSA for the entire human genome requires more memory than is available on most systems. To alleviate the problem, we provide some prebuilt indexes for download. These indexes should work on any 64-bit Intel/AMD system running May 2013 version of GCSA compiled with g++.

Human reference genome (GRCh37.p5) and the Finnish subset of frequent mutations from dbSNP database: GCSA (2.8 GB), backbone (886 MB)

References

Veli Mäkinen, Gonzalo Navarro, Jouni Sirén, and Niko Välimäki: Storage and Retrieval of Highly Repetitive Sequence Collections.
Journal of Computational Biology 17(3):281-308, 2010.

[Article] [Preprint]
Jouni Sirén, Niko Välimäki, and Veli Mäkinen: Indexing Finite Language Representation of Population Genotypes.
Proc. WABI 2011, Springer LNCS 6833, pp. 270-281, Saarbrücken, Germany, September 5-7, 2011.

[Article] [Preprint] [Full version]
Jouni Sirén: Compressed Full-Text Indexes for Highly Repetitive Collections (PhD Thesis).
Department of Computer Science, Series of Publications A, Report A-2012-5, University of Helsinki, June 2012.
[Thesis]
Jouni Sirén, Niko Välimäki, and Veli Mäkinen: Indexing Graphs for Path Queries with Applications in Genome Research.
IEEE/ACM Transactions on Computational Biology and Bioinformatics 11(2):375-388, 2014.

[Article]

Jouni.Siren@cs.helsinki.fi