20 August 2002 9:00-13:00
Motivation ·
Objectives ·
Target Audience ·
Outline ·
Presenters ·
Data Mining is mainly concerned with methodologies for extracting patterns from large data repositories. There are many data mining methods which accomplishing a limited set of tasks produces a particular enumeration of patterns over datasets. The main data mining tasks are: i) Clustering, ii) Classification, iii) Association Rule Extraction.
Since a data mining system potentially generates large numbers of patterns, questions are raised for the quality of the data mining results such as which of the extracted patterns are interesting and which of them represent valid knowledge.
In general, a pattern is interesting if it is easily understood, valid, potentially useful and novel. A pattern also is considered as interesting if it validates a hypothesis that a user sought to confirm. An interesting pattern represents useful knowledge.
The interestingness of patterns depend on the quality both of the analysed data and the data mining results. Thus several techniques have been developed aiming at evaluating and preparing the data used as input in the data mining process. Also a number of techniques and measures have been developed aiming at evaluating and interpreting the extracted patterns.
In this tutorial we address the important issue of assessing the quality of data mining results. We introduce fundamental concepts of this area while we present a review of clustering validity indices as well as approaches and measures for evaluating the classification process and association rules interestingness.
The target audience consists of researchers, practitioners and advanced students with some knowledge of data mining who desire an introduction to data mining quality assessment techniques.
The tutorial is targeted to scientists with a basic understanding of data mining, but no knowledge of quality assessment in data mining. The relevant concepts from data mining will be reviewed while the quality criteria and techniques for evaluating data mining results will be introduced and explained via examples.
1. Introduction and Motivation
It discusses the issues that are under-addressed by the recent techniques as regards the validity of data mining. It gives the motivations for introducing approaches that gives an indication of the quality of the data mining results. Then it introduces the fundamental concepts of this area.
2. Cluster Validity Fundamental Concepts
It addresses an important issue of clustering process regarding the quality assessment of the clustering results. This is also related to the inherent features of the data set under concern. A review of clustering validity indices and approaches available in the literature is presented. More specifically, this part of tutorial discusses the following sub-topics:
2.1 What is cluster validity?
2.2 Cluster Validity Criteria
2.3 Cluster Validity Indices
A review of cluster validity indices based on:
External Criteria
Internal Criteria
Relative Criteria
2.4 Experimental Study
3. Evaluation of Classification Methods
3.1 Classification Model Accuracy
The most common techniques for assessing classifier accuracy will be discussed:
Hold-out method
k-fold cross-validation
Bootstrapping
3.2 Interestingness Measures of Classification Rules
It discusses some representative measures for ranking the usefulness and utility of discovered classification patterns (classification rules).
Rule-Interest Function
Smyth and Goodman's J-Measure
General
Impressions
Gago and Bento's Distance Metric
4. Association Rules Interestingness Measures
A review of measures giving an indication of the association rules' importance and confidence will be presented. These measures could represent the predictive advantage of a rule so as to help to identify interesting patterns of knowledge in data and make decisions.
Strength
Coverage
Support
Leverage
Lift
Other
Interestingness Measures
Klemettinen et al Rule Templates ·
Gray and
Orlowska's Interestingness ·
Dong and Li's
Interestingness ·
5. Summary and Trends
It summarizes the main points of the tutorial regarding the quality assessment of data mining results. Also it gives trends in the filed and directions for further work.
Maria Halkidi and Michalis Vazirgiannis
Dept of Informatics
Athens
University of Economics & Business
Patision 76 Street, Athens
10434, Greece
Voice: +30-10-8203513(519)
Fax: +30-10-8203517
Dr. Michalis Vazirgiannis
Dr.
Vazirgiannis is an Assistance Professor in the dept of Informatics of
Athens Univ. of Economics & Business. He holds a degree in Physics
(1986), a MSc. in Robotics (1988), and a MSc. in Knowledge Based
Systems. In 1994 he obtained a Ph.D. degree in Informatics. Since
then, he has conducted research in the Knowledge & DB Lab (of N.T.U.
Athens, Greece), in GMD-IPSI (Darmstadt, Germany), in Fern-Universitaet
(Hagen, Germany) and in project VERSO in INRIA/Paris. His research
interests and work range from Data Mining to Spatiotemporal databases.
He has received twice the ERCIM fellowship. He has published two books
and over 50 papers in international conferences and journals. Currently
he is leading three international basic research projects funded by the
EU. He served as a conference committee member and as reviewer for
international conferences and journals.
Mrs. M. Halkidi, (MSc) PhD
candidate
Maria Halkidi received a B.Sc. degree in
Informatics in 1997. In 1999 she received a M.Sc. degree in
Information Systems from Athens University of Economics and Business
(AUEB). Now, she is a PhD Student in Dept. of Informatics (AUEB). The
research area is Quality and Uncertainty handling in Data Mining. Also,
she is a member of the DB-NET
research group in AUEB, participating in National and
European-funded projects. Her research interests include Knowledge &
Data Mining, Web Mining, Novel Data Management Systems (pattern-based
systems, data management in a mobile environment), Representation &
Manipulation of uncertainty in database systems. She received an award
from IKY (Greek Fellowships Foundation) for the academic year 1997-98.
She has published nine papers in international conferences and two
papers in journals. She is a student member of ACM and IEEE.