AbstractGiven two sequences A = a1a2...am and B = b1b2...bn, m <= n, over some alphabet sigma of size s, the longest common subsequence problem is to find a sequence of greatest possible length, p, that can be obtained from both A and B by deleting zero or more (not necessarily adjacent) symbols. A new algorithm that is efficient for both short and long longest common subsequences is presented. It also improves on previous methods for longest common subsequences of intermediate length. Thus, it is more flexible and can be used for a wider range of applications than others. The algorithm is based on the well-known paradigm of computing dominant matches and was obtained by observing additional structural properties leading to a kind of dualization. The worst-case running time of the algorithm is O(ns+min{pm,p(n-p)}). Some experimental results which prove the practicability of the new method are given, too.
Categories and Subject Descriptors: F.2.2 [Analysis of Algorithms and Problem Complexity]: Nonnumerical Algorithms and Problems; G.2.1 [Discrete Mathematics]: Combinatorics
Additional Key Words and Phrases: algorithms, longest common subsequence, string processing, combinatorial pattern matching, computational biology
Selected references
- David Eppstein, Zvi Galil, Faffaele Giancarlo, and Giuseppe F. Italiano. Sparse dynamic programming I: Linear cost functions. Journal of the ACM, 39(3):519-545, July 1992.
- Daniel S. Hirschberg. Algorithms for the longest common subsequence problem. Journal of the ACM, 24(4):664-675, October 1977.