Succinct Data Structures (SuDS) -research group
Group description
The research group studies a new subfield of data compression - data structure compression. The new aspect compared to traditional compression is that the compressed data (structure) needs to be represented so that access to its internal parts is provided without uncompressing the whole structure. As an example, consider a binary tree of n nodes. It is possible to represent the tree succinctly using about 2n bits so that the children and parent of any node can be accessed in constant time. A standard link structure representation of a binary tree takes of order n log n bits.
In addition to providing new algorithms and data structures in the field of study, the group contributes by engineering open source implementations targeted to applications such as sequence analysis in Bioinformatics, full-text search in Database Systems, and retrieval of structured documents in Information Retrieval.
The group is a member of Helsinki Institute for Information Technology and Finnish Centre of Excellence for Algorithmic Data Analysis Research.
News
- The group participates in the new Finnish Centre of Excellence in Cancer Genetics Research led by Academy Professor Lauri Aaltonen, starting 2012. The name of the group has changed into genome-scale algorithmics, better describing our current focus.
People
- Veli Mäkinen, Professor
- Niko Välimäki, Doctoral Student
- Jouni Siren, Doctoral Student
- Serikzhan Kazi, Research Assistant
Visitors and alumni
- Santeri Pietilä, Research Assistant (Spring/Summer 2011)
- Juha Karjalainen, Research Assistant (Autumn 2010), now PhD student at the University of Groningen.
- Riku Katainen, Research Assistant (Summer 2008-Summer 2010), now at Aaltonen lab
- Susana Ladra, PhD student, University of A Coruna, Spain (August - November, 2009)
- Antti Laaksonen, Research Assistant (summer 2009)
- Kashyap Dixit, MSc student, IIT Kanpur, India (May 2006 - July 2006)
- Wolfgang Gerlach, Diploma student, Bielefeld University, Germany (August 2006 - September 2006)
- Johannes Fischer, PhD student, LMU München, Germany (June 2007)
Collaboration
The project collaborates with several researchers abroad. A close and long-term collaboration is with Professor Gonzalo Navarro from University of Chile.
Software
- geneneralized compressed suffix array for indexing multiple alignment of several reference genomes or reference genome plus known variants. Note: Current construction algorithm requires a lot of resources for genome-scale inputs. We are working on a distributed construction algorithm to alleviate this issue.
- all-against-all suffix/prefix alignment for creating overlap graphs for de novo fragment assembly from short reads.
- readaligner for mapping (short) DNA reads into reference sequences.
- Run-Length Compressed Suffix Array and its incremental construction.
- SUDS Genome Browser.
- Frequent string mining.
- Compressed suffix tree.
- Our implementations of compressed suffix arrays and FM-indexes (like Succinct Suffix Array and Run-Length FM-index) can be downloaded from the Pizza & Chili Corpus, that contains library implementations of several compressed text indexes and testbed data collections.
Recent Publications
-
Paolo Ferragina, Jouni Sirén, and Rossano Venturini:
Distribution-Aware Compressed Full-Text Indexes.
Proc. ESA 2011, Springer LNCS 6942, pp. 760-771, Saarbrücken, Germany, September 5-7, 2011.
-
Jouni Sirén, Niko Välimäki, and Veli Mäkinen:
Indexing Finite Language Representation of Population Genotypes.
Proc. WABI 2011, Springer LNCS 6833, pp. 270-281, Saarbrücken, Germany, September 5-7, 2011.
-
Niko Välimäki, Susana Ladra, and Veli Mäkinen:
Approximate All-Pairs Suffix/Prefix Overlaps.
In Proc. 21st Annual Symposium on Combinatorial Pattern Matching (CPM 2010), Springer-Verlag, LNCS 6129, pp. 76-87, New York, USA, 21-23 June 2010.
[Article online]
[Implementation]
-
Jouni Sirén:
Sampled Longest Common Prefix Array.
In Proc. CPM 2010, Springer LNCS 6129, pp. 227-237, New York, USA, June 21-23, 2010.
[Article]
[Preprint]
[Slides]
- Veli Mäkinen, Niko Välimäki, Antti Laaksonen, and Riku Katainen:
Unified View of Backward Backtracking in Short Read Mapping.
In Ukkonen Festschrift 2010 (Eds. Tapio Elomaa, Pekka Orponen, Heikki Mannila), Springer-Verlag, LNCS 6060, pp. 182-195, 2010.
[Article online]
[Implementation] -
Veli Mäkinen, Gonzalo Navarro, Jouni Sirén, and Niko Välimäki:
Storage and Retrieval of Highly Repetitive Sequence Collections.
Journal of Computational Biology 17(3):281-308, 2010. -
Diego Arroyuelo, Francisco Claude, Sebastian Maneth, Veli Mäkinen, Gonzalo Navarro, Kim Nguyên, Jouni Sirén, and Niko Välimäki:
Fast In-Memory XPath Search over Compressed Text and Tree Indexes.
In Proc. ICDE 2010, IEEE, pp. 417-428, Long Beach, California, USA, March 1-6, 2010.
[Preprint] - Johannes Fischer, Veli Mäkinen, and Gonzalo Navarro:
Faster Entropy-Bounded Compressed Suffix Trees.
Theoretical Computer Science, 410(51):5354-5364, 2009.
- Niko Välimäki, Veli Mäkinen, Wolfgang Gerlach, and Kashyap Dixit:
Engineering a Compressed Suffix Tree Implementation.
ACM Journal of Experimental Algorithmics, 14:4.2-4.23, 2009
[Preprint]
[Implementation] -
Jouni Sirén:
Compressed Suffix Arrays for Massive Data.
In SPIRE 2009, Springer LNCS 5721, pp. 63-74, Saariselkä, Finland, August 25-27, 2009.
- Veli Mäkinen, Gonzalo Navarro, Jouni Sirén,
and Niko Välimäki. Storage and Retrieval of Individual
Genomes.
In Proc. 13th Annual International Conference on Research in Computational Molecular Biology (RECOMB 2009), Springer-Verlag LNCS 5541, pp. 121-137, Tucson, Arizona, May 18-21, 2009.
[Article online]
[Preprint]
[Implementation] - Johannes Fischer, Veli Mäkinen, and Niko Välimäki:
Space-Efficient String Mining under Frequency Constraints.
In Proc. Eighth IEEE International Conference on Data Mining (ICDM 2008), IEEE Computer Society, pages 193-202, Pisa, Italy, December 15-19, 2008.
[Article online]
[Preprint]
[Implementation] -
Veli Mäkinen and Gonzalo Navarro: Dynamic Entropy-Compressed Sequences and Full-Text Indexes.
ACM Transactions on Algorithms, 4(3):article 32, 2008.
[Preprint]
Funding
- Academy of Finland:
- "Theory and practice of succinct data structures", 1.8.2005-31.7.2007.
- Self-indexes in permanent storage, 1.8.2007-31.7.2012.
- "Algorithms and data structures for personal genomics", 1.8.2010-31.12.2012.
- Algorithmic Data Analysis (Algodan) research unit
- Funding for research assistants.
- Helsinki Doctoral Programme in Computer Science (Hecse)
- Funded PhD student position.
- Finnish Doctoral Programme in Computational Sciences (FICS)
- Funded PhD student position.
- Department of Computer Science
- "SUDS Genome Browser", 1.5.2009-31.12.2009.
- HIIT
- "Algorithms and tools for short read sequencing", 1.6.2010-31.8.2010.