Guest lecture: Frequency-related mining of string databases
Johannes Fischer, Ludwig-Maximilians-Universität München, Institut für Informatik, Algorithmische Bioinformatik
on Monday 11th June at 13:15 Exactum, room B222
Abstract:
Frequent String Mining and Emerging Substring Mining are adaptions of classical itemset mining tasks to the domain of strings. In the first part of the talk we will show how to solve such frequency-related data mining tasks in optimal time, i.e., in time linear in the size of the input and the output. The algorithm uses classic algorithmic tools and techniques such as suffix- and LCP-arrays. We show by empirical tests that our method outperformes existing implementations for frequent string mining, by several orders of magnitude. It is also the first approach applicable to realistically sized data sets, as previous approaches either use too much space, or are too slow to be used in practice.
Another building block of our algorithm is a fast and space-conscious preprocessing scheme for Range Minimum Queries (RMQs), which we will touch on in the second part of the talk. An RMQ asks for the minimum element between two given indices in an array. It is well known that RMQs are linearly equivalent to computing lowest common ancestors in trees, and can therefore, in principle, be answered in constant time, after a linear preprocessing step. We, however, will give a direct linear-time preprocessing scheme such that subsequent RMQs can be answered in constant time. It uses only 2n+o(n) bits of space (n being the length of the array), which is shown to be optimal under a suitable model of computation. The only other succinct representation of RMQ-information due to Sadakane (2007) uses 4n+o(n) bits of space, but needs O(n log(n)) bits at construction time. Our algorithm overcomes this memory bottleneck, as it never uses more space than the final data structure occupies.
Welcome!