- Genome-scale algorithmics (see GSA group):
We develop algorithms and data structures for the analysis of genome-scale data. Such data is abundant due to modern molecular biology measurement techniques like high-throughput sequencing. We are especially interested in applications of compressed data structures, that make it possible to analyse the often highly redundant data within the space of their information content. We also study other scalability aspects like distributed computation/storage around genome-scale data.
An example of our recent developments is an extension of Burrows-Wheeler transform to finite automaton representing reference genome together with its common variations among the population (see WABI 2011). This enables a space-efficient index structure to be constructed to support efficient read alignment to a rich model of the population. Finite automaton representation enables good control over the richness of the model as one can e.g. create paths representing different haplotype blocks, a property not easy to handle using e.g. HMM-based aligners.
The group is a member of Helsinki Institute for Information Technology and Finnish Centre of Excellence in Cancer Genetics Research. For the latter, we aim to tailor our finite automaton Burrows-Wheeler index to enable accurate variation calling on cancer genomics data.
- Self-indexes in permanent storage (see SuDS group):
The project focuses on the theory of index structures for textual data. Such structures can be used to speed up the
context retrieval tasks, such as to answer queries like "Does my pattern X appear inside the text?". Indexes let us
reach query times proportional only to the length of the query word, while without indexing we would need to scan
through the whole text on each query.
Self-indexes are extremely space-efficient versions of classical text indexes. The name comes from the fact that they
contain enough information to reproduce the whole text. Thus, a self-index is, in itself, an index for the text and its
representation. The most successful self-indexes up-to-date only need memory close to the compressed
text size. In other words, a compression mechanism has been found that offers added value to the pure reduced
space demand.
The aforementioned properties of self-indexes have also been validated in practice, but only in the context of main
memory indexes. As main memory solutions, self-indexes are useful tools for enabling genomic-scale analyses of
biological sequences, as without indexing those analyses would be too time-consuming. For this purpose self-indexes
suite perfectly, as they occupy no more space than the sequences themselves.
The purpose of this research is to study the opportunities of self-indexes in permanent storage. This is an important
future challenge; only when working in permanent storage, can self-indexes be considered as a universal
representation of textual data. Such solutions would have direct applications in textual databases, Internet search
engines, or even as parts of file systems in the operating system level.
- Theory and Practice of Succinct Data
Structures (SuDS):
The project studies a new subfield of data compression - data structure
compression. The new aspect compared to traditional compression is that
the compressed data (structure) needs to be represented so that access
to its internal parts is provided without uncompressing the whole
structure. As an example, consider a binary tree of n nodes. It is
possible to represent the tree succinctly using about 2n bits so that
the children and parent of any node can be accessed in constant time. A
standard link structure representation of a binary tree takes of order n
log n bits. The project concentrates on compressed full-text index structures.
Using these structures one can replace any file with a compressed one so that
subword searches are possible without uncompressing the whole file.
Moreover, the search time is only proportional to the length of the
query word. More concretely, the technique provides an opportunity to
replace .zip files with some other file format, which at the same time
occupies less memory and provides efficient searches on the file. The
first breakthroughs to make this scenario possible have already been
achieved. The aim of the project, in addition to advancing the theory,
is also to pay attention to the practical aspects in order to transfer
the theory into technology.
- Content-Based Retrieval and Analysis of Harmony and other Music
Structures (C-BRAHMS):
String matching techniques can be used for efficient retrieval of
query-melodies (produced e.g. by humming) from large music databases.
I develop algorithms
for so-called transposition invariant string matching,
that can be used for searching melodies allowing an arbitrary shift
in the pitch levels; i.e. person humming a piece does not have
to start from the correct pitch level as long as (most of) the
relative pitch levels are correct with respect to the first one.
- Integrated Computational Methods for Genomic,
Proteomic and Metabolic Modeling (ICOMIC):
My contribution was to develop methods for
automatic matching of 2D protein gel electrophoresis images.