Tree Matching Problems with Applications to Structured Text DatabasesPekka Kilpeläinen: Tree Matching Problems with Applications to Structured Text Databases. PhD Thesis, Report A-1992-6, Department of Computer Science, University of Helsinki, November 1992. 110 pages. <http://www.cs.helsinki.fi/TR/A-1992/6> Full paper: gzip'ed Postscript file AbstractTree matching is concerned with finding the instances, or matches, of a given pattern tree in a given target tree. We introduce ten interrelated matching problems called tree inclusion problems. A specific tree inclusion problem is defined by specifying the trees that are instances of the patterns. The problems differ from each other in the amount of similarity required between the patterns and their instances. We present and analyze algorithms for solving these problems, and show that the computational complexities of the problems range from linear to NP-complete. The problems are motivated by the study of query languages for structured text databases. The structure of a text document can be described by a context-free grammar, and text collections can be represented as collections of parse trees. Matching-based operations are an intuitive basis for accessing the contents of structured text databases. In ``G-grammatical'' tree inclusion problems the target tree is a parse tree over a context-free grammar G. We show that a certain natural class of grammars allows solving some of the grammatical inclusion problems in linear time. Tree inclusion problems are extended by introducing logical variables in the patterns. These variables allow extracting substructures of the pattern matches and posing equality constraints on them. We show that most of the tree inclusion problems with logical variables are NP-hard, and also consider solving their polynomial versions. As an application of these problems we finally show how tree inclusion with logical variables can be used for querying structured text databases, and discuss how the inclusion queries should be evaluated in practice. Index Terms
Categories and Subject Descriptors:
General Terms: Algorithms Additional Key Words and Phrases: structured text databases, tree pattern matching, dynamic programming, NP-completeness |
Online Publications of Department of Computer Science, Anna Pienimäki