Projects by Veli Mäkinen

Genome-scale algorithmics (see GSA group): We develop algorithms and data structures for the analysis of genome-scale data. Such data is abundant due to modern molecular biology measurement techniques like high-throughput sequencing. We are especially interested in applications of compressed data structures, that make it possible to analyse the often highly redundant data within the space of their information content. We also study other scalability aspects like distributed computation/storage around genome-scale data. An example of our recent developments is an extension of Burrows-Wheeler transform to finite automaton representing reference genome together with its common variations among the population (see WABI 2011). This enables a space-efficient index structure to be constructed to support efficient read alignment to a rich model of the population. Finite automaton representation enables good control over the richness of the model as one can e.g. create paths representing different haplotype blocks, a property not easy to handle using e.g. HMM-based aligners. The group is a member of Helsinki Institute for Information Technology and Finnish Centre of Excellence in Cancer Genetics Research. For the latter, we aim to tailor our finite automaton Burrows-Wheeler index to enable accurate variation calling on cancer genomics data.
Self-indexes in permanent storage (see SuDS group): The project focuses on the theory of index structures for textual data. Such structures can be used to speed up the context retrieval tasks, such as to answer queries like "Does my pattern X appear inside the text?". Indexes let us reach query times proportional only to the length of the query word, while without indexing we would need to scan through the whole text on each query. Self-indexes are extremely space-efficient versions of classical text indexes. The name comes from the fact that they contain enough information to reproduce the whole text. Thus, a self-index is, in itself, an index for the text and its representation. The most successful self-indexes up-to-date only need memory close to the compressed text size. In other words, a compression mechanism has been found that offers added value to the pure reduced space demand. The aforementioned properties of self-indexes have also been validated in practice, but only in the context of main memory indexes. As main memory solutions, self-indexes are useful tools for enabling genomic-scale analyses of biological sequences, as without indexing those analyses would be too time-consuming. For this purpose self-indexes suite perfectly, as they occupy no more space than the sequences themselves. The purpose of this research is to study the opportunities of self-indexes in permanent storage. This is an important future challenge; only when working in permanent storage, can self-indexes be considered as a universal representation of textual data. Such solutions would have direct applications in textual databases, Internet search engines, or even as parts of file systems in the operating system level.
Theory and Practice of Succinct Data Structures (SuDS): The project studies a new subfield of data compression - data structure compression. The new aspect compared to traditional compression is that the compressed data (structure) needs to be represented so that access to its internal parts is provided without uncompressing the whole structure. As an example, consider a binary tree of n nodes. It is possible to represent the tree succinctly using about 2n bits so that the children and parent of any node can be accessed in constant time. A standard link structure representation of a binary tree takes of order n log n bits. The project concentrates on compressed full-text index structures. Using these structures one can replace any file with a compressed one so that subword searches are possible without uncompressing the whole file. Moreover, the search time is only proportional to the length of the query word. More concretely, the technique provides an opportunity to replace .zip files with some other file format, which at the same time occupies less memory and provides efficient searches on the file. The first breakthroughs to make this scenario possible have already been achieved. The aim of the project, in addition to advancing the theory, is also to pay attention to the practical aspects in order to transfer the theory into technology.
Content-Based Retrieval and Analysis of Harmony and other Music Structures (C-BRAHMS):
String matching techniques can be used for efficient retrieval of query-melodies (produced e.g. by humming) from large music databases. I develop algorithms for so-called transposition invariant string matching, that can be used for searching melodies allowing an arbitrary shift in the pitch levels; i.e. person humming a piece does not have to start from the correct pitch level as long as (most of) the relative pitch levels are correct with respect to the first one.
Integrated Computational Methods for Genomic, Proteomic and Metabolic Modeling (ICOMIC):
My contribution was to develop methods for automatic matching of 2D protein gel electrophoresis images.

Projects by Veli Mäkinen

Home