Data Mining and Machine Learning

Team

Collaborators outside the Unit

Prof. Padhraic Smyth
(UC Irvine)
Dr. Gautam Das
(University of Texas at Arlington)
Prof. Jean-Francois Boulicaut
(INSA Lyon)
Prof. Raymond Ng
(University of British Columbia)
Prof. Kimmo Koskenniemi
(University of Helsinki)

The project develops methods and tools for analyzing large data sets and for searching for unexpected relationships in the data. The project combines development of combinatorial pattern matching algorithms with statistical techniques and database methods. The resulting techniques typically search through a large collection of potential local models that describe some aspect of the data in an easily understandable way. The project has also studied the construction of efficient predictors from large masses of data.

The group has produced several important results in methods for finding association rules, episode rules, and similarities from relational databases, event sequence data, and text. The methods have so far been applied in telecommunications, paleoecology, medical genetics and text databases. The data mining research has lots of industrial applications, and part of the research group works currently in industry.

Developing efficient, analytically wellmotivated general purpose learning algorithms for different machine learning and data mining purposes is one of our aims. One of the major goals for the next years is further integration of combinatorial and statistical techniques. The project has had good success in, e.g., approximating joint distributions by using association rules and maximum entropy principles. Similar combination techniques can profitably be used elsewhere, too: for example, ensemble methods in combination with association and episode rules can produce simple but powerful predictors. Another goal of the project are novel methods for analyzing spatial and spatiotemporal data arising in telecommunications and biological applications.

An example problem is the analysis of multiple sequences of continuous and discrete values together with background knowledge. For example, we can have sensor data from various parts of an industrial or biological process, together with topological information about the interconnections between different components. The task is to find structure from the set of sequences: the structure can be local structure, e.g., rules describing interactions between different sequences, or global structure, e.g., a process or regulatory network. The computational methods used in this domain include fast algorithms for finding patterns, approximation techniques for joint densities, and search methods for finding the global structure. Some subproblems can be formulated as classification learning tasks where the target concept changes in time. Novel learning techniques (boosting, support vector machines) are applied and futher developed for solving these problems.

The same data mining, pattern matching, and machine learning approach can also be used in the area of text analysis. For example, the group has developed methods for finding from a document collection all maximal subsequences with at least a given frequency: this collection can be used as a condensed content descriptor for the document. Another current example in the area of language technology is finding interrelated stories from different news feeds. The problem will be addressed by using a multitude of techniques, including pattern discovery and support vector machines. Once the news articles are classified into topic classes, the interconnected chains of news articles corresponding to event reports can be recognized.

Pasi.Rastas@cs.Helsinki.FI

Data Mining and Machine Learning

Team

Collaborators outside the Unit

See also