2004
In this thesis, we consider minimality and size reduction issues of one-tape and multitape automata. Although the topic of minimization of one-tape automata has been widely studied for many years, it seems that some issues have not gained attention. One of these issues concerns finding specific conditions on automata that imply their minimality in the class of nondeterministic finite automata (NFA) accepting the same language.
Using the theory of NFA minimization developed by Kameda and Weiner in 1970, we show that any bideterministic automaton (that is, a deterministic automaton with its reversal also being deterministic) is a unique minimal automaton among all NFA accepting its language. In addition to the minimality in regard to the number of states, we also show its minimality in the number of transitions. Using the same theory of Kameda and Weiner, we also obtain a more general minimality result. We specify a set of sufficient conditions under which a minimal deterministic automaton (DFA) accepting some language or the reversal of the minimal DFA of the reversal language is a minimal NFA of the language.
We also consider multitape bideterministic automata and show by a counterexample that such automata are not necessarily minimal. However, given a set of accepting computations of a bideterministic multitape automaton, we show that this automaton is a unique minimal automaton with this set of accepting computations.
We have also developed a polynomial-time algorithm to reduce the size of (one-way) multitape automata. This algorithm is based on simple language-preserving automata transformations that change the order in which transitions involving different tapes occur in the automaton graph and merge suitable states together. We present an example of a family of automata on which the reduction algorithm works well.
Finally, we apply the multitape-automata size-reduction algorithm along with the DFA minimization procedure to the two-way multitape automata appearing in a string-manipulating database system. We have implemented software to empirically evaluate our size-reduction algorithm on these automata. We have done experiments with automata corresponding to a set of string predicates defining several different string properties. Good results of these experiments suggest the usefulness of this approach on reducing the size of the automata that appear in this system.
Welcome!In this talk, I will present the concepts and implementation details underlying the FIRE Station. The FIRE Station is a new environment for manipulating finite state automata, regular languages, regular expressions, and related objects. The environment integrates all of these concepts, using relatively few concepts at its core.
Welcome!Cognitive behaviour requires complex context-dependent processing of information that emerges from the links between attentional perceptual processes, working memory and reward-based evaluation of the performed actions. We describe a computational neuroscience theoretical framework which shows how an attentional state held in a short term memory in the prefrontal cortex can by top-down processing influence ventral and dorsal stream cortical areas using biased competition to account for many aspects of visual attention. We also show how within the prefrontal cortex an attentional bias can influence the mapping of sensory inputs to motor outputs, and thus play an important role in decision making. We also show how the absence of expected rewards can switch the attentional bias signal, and thus rapidly and flexibly alter cognitive performance. This theoretical framework incorporates spiking and synaptic dynamics which enable single neuron responses, fMRI activations, psychophysical results, the effects of pharmacological agents, and the effects of damage to parts of the system, to be explicitly simulated and predicted. This computational neuroscience framework provides an approach for integrating different levels of investigation of brain function, and for understanding the relations between them. The models also directly address how bottom-up and top-down processes interact in visual cognition, and show how some apparently serial processes reflect the operation of interacting parallel distributed systems.
Welcome!Tiedonhakujärjestelmien evaluoinnissa tarkastellaan useimmiten järjestelmien tehokkuutta relevanttien dokumenttien haussa. Tehokkuus ymmärretään englannin 'effectiveness'-käsitteen mukaisesti, ei suorituskykynä (efficiency). Relevanssi on tiedonhaun evaluoinnissa keskeinen käsite, joka usein on tulkittu binäärisesti (dokumentti joko on relevantti tai ei). On kuitenkin intuitiivisesti ymmärrettävissä, että jotkut dokumentit ovat relevantimpia kuin toiset. Sen sijaan, että evaluoitaisiin, kuinka paljon relevantteja dokumentteja järjestelmät hakevat suhteessa epärelevantteihin, voitaisiin järjestelmiä verrata sen perusteella, miten relevantteja niiden hakemat dokumentit ovat. Luennolla käsitellään moniportaisen relevanssin käsitettä ja muutamia tiedonhaun evaluointimittareita, jotka pystyvät käyttämään moniportaista relevanssia hyväkseen.
Tervetuloa!In this thesis we provide computational tools for the planning of VTT-TRAC experiments. VTT-TRAC is a novel method for measuring expression levels of genes. Monitoring gene expression by measuring the amounts of transcribed mRNAs (transcriptional profiling) has become an important experimental method in molecular biology. This has been due to rapid advance in the high-throughput measurement technology. Methods like microarrays are capable of measuring thousands of expression levels in one experiment. However, new methods that are fast and reliable are still needed. In addition to scientific research, such methods have important applications for example in medical diagnostics and bioprocess control.
High-throughput technologies require automatic tools for planning of experiments. In VTT-TRAC a special fragment of DNA called a probe has to be selected for each profiled gene. In addition, the method allows multiplexing i.e. profiling a large number of genes together as a group called a pool as long as the probes in each pool can be separated from each other. Thus, the major computational challenge is to divide the probes into as small a number of pools as possible so that the whole experiment can be done as cost-efficiently as possible.
In this thesis we analyze and solve the key computational problems in the automatic planning of VTT-TRAC experiments. Especially, we show that dividing the probes into a minimal number of pools is an NP-hard problem and give an efficient approximation algorithm that is also practical. In addition, we develop a flexible method for the selection of probes. We have implemented our algorithms as a fully automatic planning software. We show by planning different kinds of real experiments that our tools allow convenient planning of cost-efficient experiments. We demonstrate computationally that almost the whole yeast genome could be profiled with less than 50 pools.
The planning of a VTT-TRAC experiment requires sequence information that is not available for all organisms. cDNA-AFLP is a method that can be used to study gene expression without complete sequence information. We propose a computational method that can be used to optimize cDNA-AFLP and VTT-TRAC experiments based on partial sequence information or sequence information on a related organism.
Welcome!In this Thesis we study issues related to learning small tree and graph formed classifiers. First, we study reduced error pruning of decision trees and branching programs. We analyze the behavior of a reduced error pruning algorithm for decision trees under various probabilistic assumptions on the pruning data. As a result we get, e.g., new upper bounds for the probability of replacing a tree that fits random noise by a leaf. In the case of branching programs we show that the existence of an efficient approximation algorithm for reduced error pruning would imply P$=$NP. This indicates that reduced error pruning of branching programs is most likely impossible in practice, even though the corresponding problem for decision trees is easily solvable in linear time.
The latter part of the Thesis is concerned with generalization error analysis, more particularly on Rademacher penalization applied to small or otherwise restricted decision trees. We develop a progressive sampling method based on Rademacher penalization that yields reasonable data dependent sample complexity estimates for learning two-level decision trees. Next, we propose a new scheme for deriving generalization error bounds for prunings of induced decision trees. The method for computing these bounds efficiently relies on the reduced error pruning algorithm studied in the first part of this Thesis. Our empirical experiments indicate that the obtained training set bounds may be almost tight enough to be useful in practice.
Welcome!A priority queue Q
find-min(): Return a minimum element stored in Q.
insert(e): Insert element e into Q and return its position for later use.
delete-min(): Remove a minimum element from Q.
delete(p): Remove the element stored at position p from Q.
decrease(p, e): Replace the element stored at position p with a smaller element e.
In many graph applications a priority queue is needed which achieves an amortized cost of O(1) for decrease() and O(log n) for delete-min(). In such
situations a Fibonacci heap could be used.
In a recent joint paper with Amr Elmasry and Claus Jensen, we studied priority queues having the same efficiency in the worst case. In my presentation
I will describe the internals of one such data structure, called a pruned binomial queue, which is a forest of pruned binomial trees. Most of the
ideas used in the data structure are borrowed from the paper on relaxed heaps by Driscoll, Gabow, Shrairman, and Tarjan from 1988. Using pruned
binomial trees as the basic building blocks, we have been able to develop a priority queue for which the constant factors in asymptotic estimates are
almost optimal. However, already pruned binomial queues are complicated, so my hope is that the presentation would generate some ideas how the data
structure could be simplified.
PE-lab's home page: http://www.diku.dk/forskning/performance-engineering/
Welcome!Structural biology studies how living organisms are built on molecular level and how different sub-units interact together. The reasoning is that the understanding of organisms' structure(s) lead to an understanding of the function(s) of the organism. One of the major tools to study biological structures is cryo-electron microscopy. The cryo-electron microscopy provides randomly oriented two-dimensional projections of biological samples ranging from single molecules to molecular complexes and even up to whole cells. From the projections, it is possible to computationally build (reconstruct) three-dimensional models of the original specimen in question. The major drawback in cryo-electron microscopy is the low signal-to-noise ratio, which severely hampers the model building.
We have developed several new computational methods for various sub-tasks within the reconstruction process. First, a new local-intensity-based method has been developed to locate roundish two-dimensional sub-images in larger cryo-electron microscopy images. Next, a method based on the Radon transform and the common line theorem has been developed to suppress noise from the cryo-electron microscopy images and thus improving the search for the true projection directions. Next, a novel approach to incorporate density constraints into the density map calculations as a real-space iterative reconstruction algorithm has been developed. Finally, a data mining approach has been used in the method developed to find and visualize different sub-structures from the density maps. All of the new methods have been implemented and tested using simulated and real data. The results indicate that our methods are at least comparable to the other existing methods and in some cases enhance the reconstruction process.
Welcome!Despite of being the sum of the decentralized and uncoordinated efforts by heterogeneous groups and individuals, the World Wide Web exhibits a well defined structure, characterized by several interesting properties. This structure was clearly revealed by Broder et al. that presented the evocative bow-tie picture of the Web, with a core comprised by a large strongly connected component and four sets of vertices distinguishable from their relations with the core. Our work is focused on further measures of the Web in order to mine the inner structure of the bow-tie. We found out that the scale-free properties permeate all the quantities related to the strongly connected components in each of the bow-tie sets. Indeed the indegree distributions in each set and the indegree in the SCC graph exhibit a power law with exponent 2.1. While power law with exponent 2.1 is a universal feature of the Web, there is no self-similarity properties: IN and OUT do not show any Bowtie Phenomena.
Welcome!For decades, reuse of software components has been acknowledged to be one of the most effective means to reduce software development costs and improve product quality. Gradually, we have also learnt that reuse of software architecture and high-level design is equally important.
Object-oriented application frameworks provide an established way of reusing the design and implementation of applications in a specific domain. Using a framework for creating applications is not a trivial task, however. The complexity, variability, and abstract nature of frameworks make them difficult to specialize. It is therefore essential that a framework is delivered with proper documentation describing its purpose, usage, overall architecture, and detailed design.
For an application developer it is particularly crucial that the framework's reuse interface is well documented. Design patterns provide a means to express that interface in a systematic way. Special tools are also valuable in supporting the framework specialization process. In our JavaFrames framework engineering environment, a pattern formalism is introduced to enable task-driven assistance for framework specialization. Based on the patterns that are used to specify framework reuse interfaces, JavaFrames offers code generation, dynamically adjusted user documentation, and the validation of architectural constraints.
Unfortunately, reuse interface specifications typically become quite extensive and complex for non-trivial frameworks. In this thesis, we discuss the possibility of reverse engineering a reuse interface specification from the source code of a framework and its example applications. A formal concept analysis algorithm is adapted to produce JavaFrames specialization patterns from Java source code, and the algorithm is implemented within the JavaFrames tool set. In two case studies, automatic pattern extraction is conducted for the Struts and JUnit frame-works. The effectiveness of the method is assessed through comparisons to manually prepared annotations of the same frameworks.
The case studies indicate that it is possible to specify the intended rules governing the framework's specializations with a precise pattern-based formalism to allow effective tool support for framework specialization. It is also possible to automatically extract a considerable portion of those specifications from source code, which makes reuse interface modeling faster and can also increase the quality of the produced models.
Welcome!In complexity theory, reductions are typically used to prove "Problem X is hard". In contrast, reductions in machine learning are a positive source of practical algorithms for solving a diverse set of learning tasks.
I will discuss a theoretical framework for analyzing these tasks which is "assumption free" in the same sense as the "online learning" model. Several of these theoretically analyzed (and optimized) algorithms have been implemented and tested yielding superior performance to other methods.
Welcome!The objective of gene mapping is to localize genes responsible for a particular disease or trait based on samples of patients and healthy individuals. Gene mapping is an important research area: identification of disease susceptibility genes allows for a better understanding the mechanisms underlying the diseases, evaluation of individual risk of future onset, and possibly individually tailored medications.
In this thesis we consider case--control study settings, where family trios with diseased offspring are ascertained. The data consists of haplotypes---strings of alleles read at a set of markers in a chromosome. Alternatively, we consider samples of independent diseased and healthy subjects. In this case haplotypes are not readily available; the data consists of phase-unknown genotypes---pairs of alleles without the information of their parental origin---over a set of markers.
We give two generic methods for non-parametric, association-based gene mapping; Haplotype Pattern Mining (HPM) and Tree Disequilibrium Test (TreeDT). The methods are unique in that the emphasis is on structure inherent in the data---haplotype trees or patterns---rather than involved statistical models. We present three instances of HPM for different kinds of data, and one of TreeDT. We also present and analyze algorithms for all the instances. Both HPM and TreeDT use permutation tests to evaluate statistical significance of the associations. Among the algorithmic contributions we present an efficient algorithm for nested permutation tests.
We conduct extensive experiments with simulated data in various settings. In our experiments HPM and TreeDT prove to be very competitive compared to alternative approaches. Advantages of HPM and TreeDT include computational efficiency and scalability with respect to the size of the data. We also show that studies with phase-unknown genotypes are more cost-effective in most cases than those with haplotypes obtained from family trios.
Welcome!I will present the principles of a new approach aimed at automatically discovering motivic patterns in monodies. It will be shown that, if the results should corroborate with the listener's understanding, the process needs to follow as closely as possible the strategies undertaken during the listening process. Motivic patterns, which may progressively follow different musical dimensions, are discovered through an adaptive incremental identification in a multi-dimensional parametric space. The combinatory redundancy that would be logically inferred from such a structure is carefully limited with the help of particular heuristics. In particular, each class of pattern occurrences is uniquely associated to the most specific pattern description. Periodic repetitions of patterns also induce combinatory proliferations of redundant patterns that need to be avoided. The computational implementation gives promising results: Analysis of simple musical pieces may produce relevant and interesting motives. However, the stability of the system is not insured yet, showing the necessity of further improvements.
Welcome!In many pattern recognition problems images have to be classified independent of their current position and orientation, which is just a nuisance parameter. Instead of comparing a measured pattern in all possible locations against the prototypes it is much more attractive to extract position-invariant and intrinsic features and to classify the objects in the feature space. Mathematically speaking, patterns form an equivalence class with respect to a geometric coordinate transform describing motion. Invariant transforms are able to map such equivalence classes into one point of an appropriate feature space.
The talk will describe new results for this classical problem and outlines general principles for the extraction of invariant features from images (Haar integrals, Lie-Theory, Normalization techniques). The nonlinear transforms are able to map the object space of image representation into a canonical frame with invariants and geometrical parameters. Beside the mathematical definition the talk will concentrate on characterizing the properties of the nonlinear mappings with respect to completeness and possible ambiguities, disturbance behaviour and computational complexity. We especially investigated Haar integrals for the extraction of invariants based on monomial and relational kernel functions.
Examples and applications will be given for the space invariant automatic classification of 3D objects (pollen classification) and for searching in image databases (image retrieval).
Welcome!The importance of data analysis has increased in various application areas in the last years. For example, in ecology, biology and biotechnology as well as in industrial applications, e.g., in mobile computing and automatic calibration, the role of computational methods has become increasingly crucial. In this work I will focus on data-analytic and computational issues that arise in paleoecology. The aim of paleoecological studies is to estimate features of the past environment.
I will address two paleoecological data analysis tasks: (i) the paleoenvironmental reconstruction problem, and (ii) zonation analysis. The paleoenvironmental reconstruction problem addresses predicting values of environmental variables in the past. An objective of zonation analysis, in turn, is to identify zone boundaries in a biostratigraphic time series. The first problem is an instance of supervised learning and the latter, in turn, is an instance of unsupervised learning. I outline a probabilistic framework for the both tasks. I introduce methods and models that can be used as analysis tools in assessing how the climate has evolved in the past. I apply full probability modeling to the paleoenvironmental reconstruction task, and introduce full probability models for the task with different characteristics. I address computational issues that arise as the models are applied. Further, I discuss automation of the reconstruction process using the Bassist software and consider opportunities for extending the Bassist language. Finally, I carry out various predictive performance tests with different models and represent reconstructions of the mean July air temperature and total dissolved nitrogen in Finnish northern Lappland and the Baltic Sea, respectively. I outline a probabilistic framework for the zonation analysis that enables simultaneous utilization of multi-proxy data sets. I address, in particular, the problem of estimating the number of zones and discuss probabilistic model selection principles and methods. I represent non-parametric frequentist methods to determine the number of zone boundaries. Further, I will introduce a simple zonation model with a Bayesian extension. I give a straightforward linear time approximation for the marginal likelihood of the Bayesian variant of the multinomial zonation model. I carry out experiments with both real-world multi-proxy data sets and simulated data.
Welcome!During the last years there has seen a wealth of research on approximate representations for time series. The vast majority of the proposed approaches represent each value with approximately equal fidelity, which may not be always desirable. For example, mobile devices and real time sensors have brought home the need for representations that can approximate the data with fidelity proportional to its age. We call such time-decaying representations amnesic.
In this work, we introduce a novel representation of time series that can represent arbitrary, user-specified time-decaying functions. We propose online algorithms for our representation, and discuss their properties. The algorithms we describe are designed to work on both the entire stream of data, or on a sliding window of the data stream. Finally, we perform an extensive empirical demonstration on 40 datasets, and show that our approach can efficiently maintain a high quality amnesic approximation.
At the end of the presentation, I will also briefly talk about some of my other work, on streaming data and database views.
Welcome!Wireless data access for nomadic users is predicted to be a major growth direction for the future Internet. However, no single existing wireless technology fits all user requirements at all times. Instead, several overlaying wireless networks can provide best possible data delivery service. Nomadic users can run interactive, conversational, streaming, and background applications that rely on end-to-end transport protocols to communicate over unreliable wireless links. Achieving efficient data transport in wireless overlay networks implies meeting Quality of Service requirements of applications while preserving radio resources, battery power, and friendliness to other flows in the Internet.
Events such as delay spikes, bandwidth oscillation, and connectivity outages are difficult to prevent in the heterogeneous and dynamic wireless environment. For instance, delay spikes can be caused by handovers, higher priority voice calls, and link layer retransmissions. Furthermore, link characteristics can change by an order of magnitude when a handover is executed between overlay networks. Such disruptive events can cause delivery of duplicate, stale, aborted data, and low utilization of the wireless link. Achieving efficient data transport in this environment demands coordinated efforts from the link layer and from end-to-end transport protocols.
In this dissertation, existing and emerging wireless networks are examined through measurements and simulations, paying special attention to the models used in simulations. The problem of inefficient data transport is resolved for reliable transfers by applying enhancements to the Transmission Control Protocol. Performance of real-time transfers is improved using the Lifetime Packet Discard that relies on an IP option for cross-layer communication. Finally, the effect of vertical handovers on transport protocols is evaluated and methods to prevent congestion or underutilization of the wireless link are presented.
Welcome!Most mainstream model-checking methods are, in one way or another, based on the computation of the set of reachable states of the system under analysis. This is a computationally expensive task, and in practice limits the applicability of the approach to systems with at most a few hundred significant state elements, after all available reductions. An appealing alternative, which avoids the computation of the reachable state space, is provided by the classic approach of inductive invariants. When mixed with the technique of symbolic simulation, this yields a practical and effective verification method that allows us to address several magnitudes larger verification problems than before. I'll talk about this approach and some of the work I have done with it on the Pentium 4 designs over the past few years. I will also shortly discuss my views about the role of formal verification in the industry in a more general level.
Welcome!This work is motivated by a genetic data analysis task: recurrence risk analysis. Recurrence risks in various types of relatives, e.g., offspring and siblings, characterize the inheritance pattern of a genetic disorder. Given observations of the population prevalence and different recurrence risks, the task is to infer the number of underlying genes and the frequencies and effects of different variants of the genes. Due to a complex relationship of relatively simple data and a large number of model parameters, this problem is challenging both statistically and computationally. Straightforward application of existing techniques is not sufficient.
In the first part of this thesis, we study three general methodological issues. First, we review the Bayesian paradigm for statistical inference. Special emphasis is on certain fundamental difficulties of practical Bayesian methods. Second, we study the sum-product problem, that is, marginalization of a multidimensional function that factorizes into a product of low-dimensional functions. We introduce a novel algorithm that improves the well-known variable elimination method by employing fast matrix multiplication techniques. A special type of sum-product problems, called transformation problems, are studied. We generalize a technique known as the Yates algorithm and show how the Möbius transformation on a subset lattice can be computed efficiently. Third, we consider the problem of integrating a multidimensional function. We describe a sophisticated method based on a tempering technique and Metropolis-coupled Markov chain Monte Carlo simulation. We argue that this method is particularly suitable to computations required in Bayesian data analysis. Connections to related methods are also explored.
The second part is devoted to the genetic application. We introduce a genetic model called an epistatic Mendelian model. For computing recurrence risks under a fully specified model, we present a method that employs the Yates algorithm. Based on a different problem representation, we also give another algorithm that uses the fast Möbius transform. To integrate over the model parameters we present a version of the Metropolis-coupled Markov chain Monte Carlo method. Finally, we report experimental results that support two main conclusions. First, Bayesian analysis of recurrence risks is computationally feasible. Second, recurrence risk data provides interesting information concerning competitive genetic hypotheses, but the amount of information varies depending on the data set.
Welcome!Genetic linkage analysis is a useful statistical tool for mapping disease genes and for associating functionality of genes to their location on the chromosome. I will describe a program, called SUPERLINK, for linkage analysis and demonstrate its performance. I will focus on the relevant combinatorial problems that need to be solved in order to optimize the running time and space requirements of these types of programs, and on some new capabilities of this software. The talk is intended for audience with no background in Genetics. Joint work with Ma'ayan Fishelson, Dmitry Rusakov, and Nickolay Dovgolevsky.
Welcome!Given a set of strings, the subsequence automaton accepts all subsequences of these strings. We will derive a lower bound for the maximum number of states of this automaton. We will prove that the size of the subsequence automaton for a set of k strings of length n is \Omega(n^k) for any k>=1. It solves an open problem posed by Crochemore and Tronicek in 1999, in which only the case k<=2 was shown. This is a joint work with Zdenek Tronicek.
Welcome!tiedottaja@cs.helsinki.fi