Helena Ahonen:
Generating Grammars for Structured Documents Using
Grammatical Inference Methods.
A-1996-4.
Dictionaries, user manuals, encyclopedias, and annual reports
are typical examples of structured documents.
Structured documents have an internal, usually hierarchical,
organization
that can be used, for instance, to help in retrieving information
from the documents and in transforming documents into another form.
The document structure is typically represented
by a context-free or regular grammar.
Many structured documents, however, lack the grammar:
the structure of individual documents is known but
the general structure of the document class is not available.
Examples of this kind of documents include documents that
have Standard Generalized Markup Language (SGML) tags but
not a Document Type Definition (DTD).
In this thesis we present a technique for generating a grammar describing the structure of a given structured document instances. The technique is based on ideas from machine learning. It forms first finite-state automata describing the given instances completely. These automata are modified by considering certain context conditions; the modifications correspond to generalizing the underlying language. Finally, the automata are converted into regular expressions, which are then used to construct the grammar. Some refining operations are also presented that are necessary for generating a grammar for a large and complicated document. The technique has been implemented and it has been experimented using several document types.
Hannu Toivonen:
Discovery of Frequent Patterns in Large Data
Collections.
A-1996-5.
Data mining, or knowledge discovery in databases,
aims at finding useful regularities in large data sets.
Interest in the field is motivated by the growth
of computerized data collections and by the
high potential value of patterns discovered in those collections.
For instance, bar code readers at supermarkets
produce extensive amounts of data about purchases.
An analysis of this data can reveal useful information
about the shopping behavior of the customers.
Association rules, for instance, are a class
of patterns that tell which products tend to be
purchased together.
The general data mining task we consider is the following: given a class of patterns that possibly have occurrences in a given data collection, determine which patterns occur frequently and are thus probably the most useful ones. It is characteristic for data mining applications to deal with high volumes of both data and patterns.
We address the algorithmic problems of determining efficiently which patterns are frequent in the given data. Our contributions are new algorithms, analyses of problems, and pattern classes for data mining. We also present extensive experimental results. We start by giving an efficient method for the discovery of all frequent association rules, a well known data mining problem. We then introduce the problem of discovering frequent patterns in general, and show how the association rule algorithm can be extended to cover this problem. We analyze the problem complexity and derive a lower bound for the number of queries in a simple but realistic model. We then show how sampling can be used in the discovery of exact association rules, and we give algorithms that are efficient especially in terms of the amount of database processing. We also show that association rules with negation and disjunction can be approximated efficiently. Finally, we define episodes, a class of patterns in event sequences such as alarm logs. An episode is a combination of event types that occur often close to each other. We give methods for the discovery of all frequent episodes in a given event sequence.
The algorithm for the discovery of association rules has been used in commercial data mining products, the episode algorithms are used by telecommunication operators, and discovered episodes are used in alarm handling systems.
Henry Tirri:
Plausible Prediction by
Bayesian Inference.
A-1997-1.
The capability to perform inference with uncertain and incomplete
information is characteristic to intelligent systems. Many of the
research issues in artificial intelligence and computational
intelligence can actually be viewed as topics in the ``science of
uncertainty,'' which addresses the problem of plausible inference,
i.e., optimal processing of incomplete information. The various
different approaches to model and implement intelligent behavior such
as neural networks, fuzzy logic, non-monotonic (default) logics and
Bayesian networks all address the same problem of finding an
appropriate language and inference mechanism to perform plausible
inference, needed to implement such activities as prediction, decision
making, and planning.
In this work we study the problem of plausible prediction, i.e., the problem of building predictive models from data in the presence of uncertainty. Our approach to this problem is based on the language of Bayesian probability theory both in its traditional and information theoretic form. We study Bayesian prediction theoretically and empirically with finite mixture models. Such models are interesting due to their ability to accurately model complex distributions with few parameters. In addition, finite mixture models can be viewed as a probabilistic formulation of many model families commonly used in machine learning and computational intelligence. We first address the question of how an intelligent system should predict given the available information. We present three alternatives for probabilistic prediction: single model based prediction, evidence based prediction, and minimum encoding based prediction. We then compare the empirical performance of these alternatives by using a class of finite mixture models. The empirical results demonstrate that, especially for small data sets, both the evidence and the minimum encoding approaches outperform the traditionally used single model approach.
We then focus on the problem of constructing finite mixture models from the given data and a priori information. We give the Bayesian solution for inferring both the most probable finite mixture model structure, i.e., the proper number of mixture components, and the most probable model within the class. For general mixture models the exact solution in both problems is computationally infeasible. Thus we also evaluate the quality of approximate approaches.
The Bayesian predictive approach presented can be applied to a wide class of prediction problems appearing in various application domains, e.g., medical and fault diagnostic problems, design problems and sales support systems. Using publicly available data sets, we demonstrate empirically that Bayesian prediction with finite mixtures is highly competitive when compared to the results achieved with other popular non-Bayesian approaches using, for example, neural network and decision tree models. The Bayesian prediction method presented constitutes the kernel of the D-SIDE/C-SIDE software currently used in industrial applications.
Greger Lindén:
Structured Document Transformations.
A-1997-2.
We present two techniques for transforming structured
documents. The first technique, called TT-grammars, is based on
earlier work by Keller et al., and has been extended to fit structured
documents. TT-grammars assure that the constructed transformation
will produce only syntactically correct output even if the source and
target representations may be specified with two unrelated
context-free grammars. We present a transformation generator called
ALCHEMIST which is based on TT-grammars. ALCHEMIST has been
extended with semantic actions in order to make it possible to build
full scale transformations. ALCHEMIST has been extensively used in
a large software project for building a bridge between two development
environments. The second technique is a tree transformation method
especially targeted at SGML documents. The technique employs a
transformation language called TranSID, which is a declarative,
high-level tree transformation language. TranSID does not require
the user to specify a grammar for the target representation but
instead gives full programming power for arbitrary tree modifications.
Both ALCHEMIST and TranSID are fully operational on UNIX
platforms.
Matti Nykänen:
Querying String Databases with Modal Logic.
A-1997-3.
New application areas for databases demand the possibility of managing
not only the traditional atomic but also structured data types. One
type of this kind is a finite sequence of characters drawn from a finite
alphabet. These string databases are important for example in molecular
biology data management, because they can for instance represent DNA
sequences directly as strings from alphabet
{A, C, G, T}. Then
the query language must be able to manipulate strings both as
indivisible entities and as ordered sequences of distinct characters,
perform pattern matching in strings, compare several strings with each
other, and generate new strings not yet in the database. This work
presents Alignment Calculus, a new modal logic extension of the
relational calculus, which satisfies all these requirements with a new
unified string manipulation formalism. This formalism is based on the
concept of multiple alignment of several strings; computational
molecular biology employs an analogous concept. In the language,
alignments are described logically with string formulae, a new
extension of linear-time temporal logic into multiple strings. The
abstract expressive power of Alignment Calculus is shown to equal the
Arithmetical Hierarchy. The work develops also a syntactically safe
sublanguage, whose queries require only finite computational
resources. This sublanguage is constructed by first identifying a
decidable subcase of string formula safety analysis, even though the
general problem is shown to be undecidable. This safe sublanguage is
shown to equal the Polynomial-time Hierarchy in its expressive power,
and therefore to capture all the string queries occurring in practical
situations. The operational counterpart of Alignment Calculus,
Alignment Algebra, is developed by replacing the selection operator of
relational algebra with one employing multitape nondeterministic
finite state automata corresponding to the aforementioned string
formulae, and adding an explicit domain symbol. The aforementioned safe
sublanguage has also a counterpart in this algebra: expressions, in
which all the domain symbol occurrences are restricted by immediately
enclosing automata. A finite evaluation strategy is developed for
these expressions.