Veli Mäkinen
vmakinen@cs.Helsinki.FI
Department of Computer Science, P.O Box 26 (Teollisuuskatu 23)
FIN-00014 University of Helsinki, Finland.
Abstract. Approximate string matching problem is to find all
occurrences of a pattern of size m in a text of size n allowing at most
k errors (insertions, deletions, and substitions). Dynamic programming
can be applied to the problem, providing a O(mn) search method in the worst
case, and O(kn) in the average case. Using bit-parallelism, the dynamic
programming algorithms can be improved by a factor w, where w is the length
of the computer word. Another way to speed up approximate string matching
is indexing; a static text can be preprocessed by constructing an index
structure for the text such that queries of its context can be answered
quickly. Exact string matching queries can be answered in O(m) time using
a suffix tree as an index structure, but a similar straightforward
method for approximate string matching requires a depth-first search on
suffix tree (limited to the level m+k+1), and has time complexity O(c^(m+k)),
where c is the size of the alphabet.
In this thesis, two new bit-parallel algorithms
are examined and compared. Modifications for both algorithms are presented
such that they can be used in depth-first search based queries on suffix
indices, like suffix tree, suffix cactus, and suffix array.
The speed up against the basic strategy (depth-first search combined with
O(mn) dynamic programming algorithm) is experimentally verified.
The use of bit-parallel algorithms do not ease the
problem with depth-first search; for long patterns and high error levels,
searching may be faster with online-algorithms. Pattern partitioning can
be used to overcome this problem. Pattern can be partitioned into smaller
pieces so that each piece is searched with fewer errors. All the occurrences
of these pieces are checked against the pattern, and the matching ones
are reported. This has the advantage that the depth-first search is limited
earlier, but then again there are more occurrences to be checked. This
is a tradeoff-problem, where the number of pattern partitions should be
chosen to minimize the overall cost of the search.
In this thesis, a new method to estimate the optimal
pattern partitioning is developped, based on the statistics of the text
obtained from the suffix tree combined with open form formulas for the
probability of an approximate match. Comparison against an estimate, that
is based on the asymtotic behaviour of the probability of an approximate
match, shows that the new method is noticeably more accurate. Still, the
choice of the right partitioning is shown to be crucial to gain a significant
speed up against online-algorithms, because often only one partitioning
stands out from the rest. The estimates used in the thesis are still too
inaccurate to predict the right partitioning every time. However, the thesis
shows that the use of text statistics obtained from the suffix tree is
a promising strategy, and should be the subject of a more extensive study.