AbstractGiven a pair of nonidentical complex objects, defining (and determining) how similar they are to each other is a nontrivial problem. In data mining applications, one frequently needs to determine the similarity between two time series. We analyze a model of time-series similarity that allows outliers, different scaling functions, and variable sampling rates. We present several deterministic and randomized algorithms for computing this notion of similarity. The algorithms are based on nontrivial tools and methods from computational geometry. In particular, we use properties of families of well-separated geometric sets. The randomized algorithm has provably good performance and also works extremely efficiently in practice.
Categories and Subject Descriptors: H.2.8 [Database Management]: Database Applications; I.3.5 [Computer Graphics]: Computational Geometry and Object Modeling
Additional Key Words and Phrases: data mining, time-series, similarity measures, geometric sets
Selected references
- Bernard Chazelle, Leo J. Guibas, and D. T. Lee. The power of geometric duality. In 24th Annual Symposium on Foundations of Computer Science, pages 217-225, Tucson, Arizona, 7-9 November 1983. IEEE.
- Christos Faloutsos, M. Ranganathan, and Yannis Manolopoulos. Fast subsequence matching in time-series databases. In Richard T. Snodgrass and Marianne Winslett, editors, Proceedings of the 1994 ACM SIGMOD International Conference on Management of Data, pages 419-429, Minneapolis, Minnesota, 24-27 May 1994. SIGMOD Record 23(2), June 1994.
- H. V. Jagadish, Alberto O. Mendelzon, and Tova Milo. Similarity-based queries. In Proceedings of the Fourteenth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, pages 36-45, San Jose, California, 22-25 May 1995.