Attribute similarity and event sequence similarity in data miningPirjo Ronkainen: Attribute similarity and event sequence similarity in data mining. Ph.Lic. Thesis, Report C-1998-42, Department of Computer Science, University of Helsinki, October 1998. 98 pages. <http://www.cs.helsinki.fi/TR/C-1998/42> Full paper: gzip'ed Postscript file AbstractIn data mining and knowledge discovery, similarity between objects is one of the central concepts. A measure of similarity can be user-defined, but an important problem is defining similarity on the basis of data. In this thesis we consider two kinds of similarity notions: similarity between binary valued attributes and between event sequences. Traditional approaches for defining similarity between two attributes typically consider only the values of those two attributes, not the values of any other attributes in the relation. Such similarity measures are often useful, but unfortunately, they cannot reflect certain kinds of similarity. Therefore, we introduce a new attribute similarity measure that takes into account the values of the other attributes. The behavior of the different measures of attribute similarity is demonstrated by giving empirical results on two real-life data sets. We also present a simple model for defining similarity between event sequences. The model is based on the idea that a similarity notion should somehow reflect how much work is needed in transforming an event sequence to another. We formalize this notion as edit distance between sequences. We show how the resulting measure of distance can be efficiently computed using a form of dynamic programming, and we also give some experimental results on two real-life data sets. As one possibility of using the similarity notions discussed, we present how attributes and event sequences can be clustered to hierarchies. We describe three standard agglomerative hierarchical clustering methods, and give a set of clustering measures needed in finding the best clustering in the hierarchy of clusterings. The results of our experiments show that with these methods we can produce natural clusterings of attributes and event sequences. Index Terms
Categories and Subject Descriptors:
General Terms: Additional Key Words and Phrases: Similarity, Distance, Clustering, Data mining, Knowledge Discovery |
Online Publications of Department of Computer Science, Anna Pienimäki