Discovering frequent patterns from strings

Jaak Vilo: Discovering frequent patterns from strings. Report C-1998-9, Department of Computer Science, University of Helsinki, 1998. 19 pages. <http://www.cs.helsinki.fi/TR/C-1998/9>

Full paper: gzip'ed Postscript file
Metadata: XML file

Abstract

We have developed a new algorithm for discovering frequently occurring patterns from strings. Our algorithm generates systematically a pattern trie while maintaining information about the occurrences of each pattern. Patterns are constructed incrementally by expanding the prefixes of the frequent-enough patterns. Only the patterns that really occur in input strings and are frequent enough, are generated and analyzed. Hence, the higher the minimum frequency threshold, the faster the algorithm works. Pattern classes that can be generated in this way include the substring patterns, substring patterns containing group characters, and patterns containing wildcard positions. By a small modification the algorithm is able to take as input a set of strings that are classified as positive and negative examples, and to construct the patterns while tracking simultaneously the occurrences in each input set. The algorithm has been successfully used for analyzing the full genome of the yeast for predicting certain regulatory elements.

Index Terms

Categories and Subject Descriptors:

General Terms:

Additional Key Words and Phrases: pattern discovery algorithms, suffix tree, analysis of DNA and protein sequences

Online Publications of Department of Computer Science, Anna Pienimäki

Department of Computer Science

Department information

Discovering frequent patterns from strings

Abstract

Index Terms