Publications of the Data Mining Group at the University of Helsinki

Here is a slightly commented list of some of the publications of the group in areas related to data mining.

NOTE! Unfortunately the list is NOT up-to-date. View the publications listed in the home pages of the members of the research group to see more up-to-date lists. This list and these pages will be updated ASAP. (mk 6/99)

Association Rules Sequence Data Theory

Surveys Sampling Machine Learning

Database Design and
Data Mining Discovering Document
Structures Data Mining
in Text


Association Rules	Sequence Data	Theory
Surveys	Sampling	Machine Learning
Database Design and Data Mining	Discovering Document Structures	Data Mining in Text

Association rules are a simple formalism for expressing positive connections between columns in a 0/1 matrix. A fast algorithm for finding such rules is given in

R. Agrawal, H. Mannila, R. Srikant, H. Toivonen, and A. I. Verkamo: Fast discovery of association rules. Chapter 12 in Usama M. Fayyad, Gregory Piatetsky-Shapiro, Padhraic Smyth, and Ramasamy Uthurusamy, editors, Advances in Knowledge Discovery and Data Mining, 307 - 328, 1996. AAAI Press.

The following paper shows how association rules can be found in only one database pass almost always, by using a random sample to bootstrap the discovery.

H. Toivonen. Sampling large databases for association rules. In 22th International Conference on Very Large Databases (VLDB'96), 134-145, Mumbay, India, September 1996. Morgan Kaufmann.

An experiment using a general purpose database management system to support the search for association rules is reported in

M. Holsheimer, M. Kersten, H. Mannila, and H. Toivonen. A perspective on databases and data mining. In First International Conference on Knowledge Discovery and Data Mining (KDD'95), 150-155, Montreal, Canada, August 1995. AAAI Press.

The methods above discover large collections of rules, and tools are needed to help in locating the interesting ones. The following two papers consider this problem.

M. Klemettinen, H. Mannila, P. Ronkainen, H. Toivonen, and I. Verkamo: Finding interesting rules from large sets of discovered association rules. In Proceedings of the Third International Conference on Information and Knowledge Management (CIKM'94), 401-407, Gaithersburg, Maryland, November 1994. ACM Press.
H. Toivonen, M. Klemettinen, P. Ronkainen, K. Hätönen, and H. Mannila. Pruning and grouping discovered association rules. In MLnet Workshop on Statistics, Machine Learning, and Discovery in Databases, 47-52, Heraklion, Crete, Greece, April 1995.

The algorithms for finding association rules work by finding frequent sets of attributes. This approach has surprising uses also in finding other types of rules.

H. Mannila and H. Toivonen. Multiple uses of frequent sets and condensed representations. In Second International Conference on Knowledge Discovery and Data Mining (KDD'96), 189-194, Portland, Oregon, August 1996. AAAI Press.

What do you do with large sequences of events? The following paper studies how to find sets of interconnected events from such sequences.

H. Mannila, H. Toivonen, and A. I. Verkamo. Discovery of frequent episodes in event sequences. Data Mining and Knowledge Discovery, 1(3): 259 - 289, November 1997. (Preliminary Report C-1997-15, University of Helsinki, Department of Computer Science, February 1997.)
The two different approaches of the above paper have been first presented in

H. Mannila, H. Toivonen, and A. I. Verkamo. Discovering Frequent Episodes in Sequences. In First International Conference on Knowledge Discovery and Data Mining (KDD'95), 210 - 215, Montreal, Canada, August 1995. AAAI Press.

and in

H. Mannila and H. Toivonen. Discovering generalized episodes using minimal occurrences. In Second International Conference on Knowledge Discovery and Data Mining (KDD'96), 146-151, Portland, Oregon, August 1996. AAAI Press.
The next paper describes a system built for the analysis of telecommunications alarm databases:

K. Hätönen, M. Klemettinen, H. Mannila, P. Ronkainen, and H. Toivonen: Knowledge Discovery from Telecommunication Network Alarm Databases. In 12th International Conference on Data Engineering (ICDE'96), 115-122, New Orleans, Louisiana, February 1996. IEEE Computer Society Press. (article without figures)
A telecommunications view of this system is given in the following paper:

K. Hätönen, M. Klemettinen, H. Mannila, P. Ronkainen, and H. Toivonen: TASA: Telecommunications Alarm Sequence Analyzer, or "How to enjoy faults in your network". In IEEE/IFIP 1996 Network Operations and Management Symposium (NOMS'96), 520-529, Kyoto, Japan, April 1996. IEEE.
A Bayesian tool for the problem of modelling dependencies between events is described in the following papers:

E. Arjas, H. Mannila, M. Salmenkivi, R. Suramo, and H. Toivonen: BASS: Bayesian analyzer of event sequences. In Proceedings in Computational Statistics (COMPSTAT'96) 199-204, Barcelona, Spain, August 1996. Physica-Verlag.
H. Toivonen, H. Mannila, M. Salmenkivi & K.-P. Laakso: Bassist - a tool for MCMC simulation of statistical models. In In: 3rd Int'l Congress of the Federation of European Simulation Societies (EUROSIM'98), 1998.
M. Eerola, H. Mannila and M. Salmenkivi: Frailty factors and time-dependent hazards in modelling ear infections in children using BASSIST. To appear in XIII Symposium on Computational Statistics (COMPSTAT'98), 1998.
One of the main goals of knowledge discovery is to produce useful and valuable information for the users. The following papers consider not only the utilization aspect but also the whole discovery process:

Mika Klemettinen, Heikki Mannila, and Hannu Toivonen. A Data Mining Methodology and Its Application to Semi-automatic Knowledge Acquisition. In DEXA'97 Workshop, September 1-5, 1997. IEEE Computer Society Press.
Mika Klemettinen, Heikki Mannila, and Hannu Toivonen. Rule Discovery in Telecommunication Alarm Data. Journal of Network and Systems Management, Plenum Press, 1998. To appear.
Mika Klemettinen, Heikki Mannila, and Hannu Toivonen. Interactive Exploration of Interesting Findings in TASA. Information and Software Technology, Elsevier Science, June/July 1998. To appear.

There are lots of ad hoc studies in data mining. Could one obtain some general results? A possible framework is given in

H. Mannila and H. Toivonen. Levelwise search and borders of theories in knowledge discovery. Data Mining and Knowledge Discovery 1(3): 241 - 258, November 1997. (Preliminary Report C-1997-8, University of Helsinki, Department of Computer Science, January 1997.)

An early version of the paper appeared as

H. Mannila and H. Toivonen. On an algorithm for finding all interesting sentences. In Cybernetics and Systems (EMCSR'96), 973-978, Vienna, Austria, April 1996. Austrian Society for Cybernetic Studies.

Heikki Mannila: A tutorial on data mining. Proceedings of International Conference on Database Theory (ICDT'97), Delphi, Greece, January 1997, F. Afrati and P. Kolaitis (ed.), p. 41-55. Slides of the talk.

Heikki Mannila: Data mining: machine learning, statistics, and databases. Eight International Conference on Scientific and Statistical Database Management, Stockholm June 18-20, 1996, p. 1-8. Slides of the talk.

The PhD thesis of Hannu Toivonen is not actually a survey, but it covers the important area of the discovery of frequent patterns. Well-known examples of frequent patterns are, e.g., association rules and episodes. Aspects handled in the work include a generic algorithm for the task of discovering frequent patterns, analyses of such tasks, the use of sampling, and rules with negation and disjunction.

Hannu Toivonen: Discovery of frequent patterns in large data collections. PhD Thesis, Report A-1996-5, University of Helsinki, Department of Computer Science, November 1996.

When there is a lot of data to analyze, sampling can ease the task. The following paper considers the relationship between the logical form of sentences and the sample size needed for reliable identification of the sentences.

J. Kivinen and H. Mannila: The power of sampling in knowledge discovery. In Proceedings of the 1994 ACM SIGACT-SIGMOD-SIGACT Symposium on Principles of Database Theory (PODS'94), 77-85, Minneapolis, MN, May 1994. ACM Press.

A similar study in the context of functional dependencies is presented in

J. Kivinen and H. Mannila: Approximate dependency inference from relations. Theoretical Computer Science 149 (1), 129-149, September 1995.

The following paper shows how association rules can be found in only one database pass almost always, by using a random sample to bootstrap the discovery.

H. Toivonen. Sampling large databases for association rules. In 22th International Conference on Very Large Databases (VLDB'96), 134-145, Mumbay, India, September 1996. Morgan Kaufmann.

The following papers are also more or less related to data mining / knowledge discovery, but they have a more classical machine learning orientation.
The paper

P. Kilpeläinen, H. Mannila, and E. Ukkonen: MDL Learning of Unions of Simple Pattern Languages from Positive Examples. In Computational Learning Theory, Second European Conference (EuroCOLT'95), 252-260, Barcelona, March 1995. Springer-Verlag.

tries to find a simple nontrivial class of concepts for which one could say something definite about the approximations to the MDL principle.
Humans seem (at least sometimes) to use rules which have exceptions: if so-and-so, then thus, unless so-and-so2, in which case thus2, etc. Properties of such rule formalisms and how to learn such rules are studied in.

J. Kivinen, H. Mannila, and E. Ukkonen: Learning rules with local exceptions. In Computational Learning Theory (EuroCOLT'93), 35-36, Clarendon Press, Oxford 1994.
J. Kivinen, H. Mannila, and E. Ukkonen: Learning hierarchical rule sets. In Proceedings of the Fifth Annual ACM Workshop on Computational Learning Theory, 37-44, July 27-29, 1992.
J. Kivinen, H. Mannila, E. Ukkonen, and J. Vilo: An algorithm for learning hierarchical classifiers. ECML'94.

In database design, one can use data mining methods to look for integrity constraints in a database instance. See the following book for this, and some other issues.

H. Mannila and K.-J. Räihä: Design of Relational Databases. Addison-Wesley Publishing Company 1992; ISBN 0-201-56523-4. Reprinted 1994.

A new, efficient method for discovering functional dependencies and approximate functional dependencies is described in the following paper. The scale-up properties of the algorithm are superior to previous algorithms.

Y. Huhtala, J. Kärkkäinen, P. Porkka, and H. Toivonen: Efficient discovery of functional and approximate dependencies using partitions. In 14th International Conference on Data Engineering (ICDE'98), Orlando, Florida, February 1998. IEEE Computer Society Press.

One interesting new area are the so called inductive databases, where the database consists of data part and pattern part.

Heikki Mannila: Inductive databases and condensed representations for data mining. In Int'l Logic Programming Symposium, 1997, p. 21-30.

Some of the data mining issues in database design are considered in other papers by Heikki Mannila.

H. Ahonen, H. Mannila, and E. Nikunen: Generating grammars for SGML tagged texts lacking DTD. Workshop on Principles of Document Processing (PODP'94), Darmstadt, 1994. Also to appear in Mathematical and Computer Modelling.
H. Ahonen, H. Mannila, and E. Nikunen: Forming grammars for structured documents. Proceedings of the 1993 Workshop on Knowledge Discovery in Databases (KDD'93), 314-325, Washington, D.C., July 1993. AAAI Press.
Helena Ahonen. Automatic generation of SGML content models. In Electronic Publishing '96, Palo Alto, California, USA, September 1996.
Helena Ahonen. Disambiguation of SGML content models. In Proceedings of the PODP'96 Workshop on the Principles of Document Processing, Palo Alto, California, USA, September 1996.
Helena Ahonen. Generating Grammars for Structured Documents Using Grammatical Inference Methods. Ph.D. thesis, University of Helsinki, Department of Computer Science, Series of Publications A, Report A-1996-4, November 1996.
Helena Ahonen, Barbara Heikkinen, Oskari Heinonen, and Mika Klemettinen. Improving the accessibility of SGML documents - A content-analytical approach. In SGML Europe'97, Barcelona, Spain, May 1997. To appear.
Helena Ahonen, Barbara Heikkinen, Oskari Heinonen, Jani Jaakkola, and Mika Klemettinen. Analysis of Document Structures for Element Type Classification. In Fourth International Workshop on Principles of Digital Document Processing PODDP'98, Saint Malo, France, March 29-30, 1998.

Recently, we have analysed document collections using data mining methods. This new field of application, text mining, is closely related to information retrieval and, in our case, text analysis in general.

Helena Ahonen, Oskari Heinonen, Mika Klemettinen, and Inkeri Verkamo. Mining in the Phrasal Frontier. In 1st European Symposium on Principles of Data Mining and Knowledge Discovery (PKDD'97), Trondheim, Norway, June 25-27, 1997. To appear. The pre-version is available as Report C-1997-14, University of Helsinki, Department of Computer Science, February 1997.
Helena Ahonen, Oskari Heinonen, Mika Klemettinen, and Inkeri Verkamo. Applying Data Mining Techniques in Text Analysis. Report C-1997-23, University of Helsinki, Department of Computer Science, March 1997.
Helena Ahonen, Oskari Heinonen, Mika Klemettinen, and Inkeri Verkamo. Applying Data Mining Techniques for Descriptive Phrase Extraction in Digital Document Collections. In Advances in Digital Libraries'98, Santa Barbara, California, USA, April 1998.

Last update on May 6, 1998. [index]

This page is maintained by

Hannu.Toivonen@Helsinki.FI Mika.Klemettinen@Helsinki.FI