Longest Common Subsequence

Longest common subsequence

In biological applications, we often want to compare the DNA of two (or more) different organisms. A strand of DNA consists of a string of molecules called bases, where the possible bases are adenine, guanine, cytosine, and thymine. Representing each of these bases by their initial letters, a strand of DNA can be expressed as a string over the finite set {A, C, G, T}. For example, the DNA of one organism may be S₁= ACCGGTCGAGTGCGCGGAAGCCGGCCGAA, while the DNA of another organism may be S₂ = GTCGTTCGGAATGCCGTTGCTCTGTAAA. One goal of comparing two strands of DNA is to determine how "similar" the two strands are, as some measure of how closely related the two organisms are. Similarity can be and is defined in many different ways. For example, we can say that two DNA strands are similar if one is a substring of the other. In our example, neither S₁ nor S₂ is a substring of the other. Alternatively, we could say that two strands are similar if the number of changes needed to turn one into the other is small. Yet another way to measure the similarity of strands S₁ and S₂ is by finding a third strand S₃ in which the bases in S₃ appear in each of S₁ and S₂; these bases must appear in the same order, but not necessarily consecutively. The longer the strand S₃ we can find, the more similar S₁ and S₂ are. In our example, the longest strand S₃ is GTCGTCGGAAGCCGGCCGAA.

We formalize this last notion of similarity as the longest-common-subsequence problem. A subsequence of a given sequence is just the given sequence with zero or more elements left out. Formally, given a sequence X = 〈x₁, x₂, ..., x_m〉, another sequence Z = 〈z₁, z₂, ..., z_k〉 is a subsequence of X if there exists a strictly increasing sequence 〈i₁,i₂, ..., i_k〉 of indices of X such that for all j = 1, 2, ..., k, we have x_ij = z_j . For example, Z = 〈B, C, D, B〉 is a subsequence of X = 〈A, B, C, B, D, A, B〉 with corresponding index sequence 〈2, 3, 5, 7〉.

Given two sequences X and Y , we say that a sequence Z is a common subsequence of X and Y if Z is a subsequence of both X and Y . For example, if X = 〈A, B, C, B, D, A, B〉 and Y = 〈B, D, C, A, B, A〉, the sequence 〈B, C, A〉 is a common subsequence of both X and Y . The sequence 〈B, C, A〉 is not a longest common subsequence (LCS) of X and Y , however, since it has length 3 and the sequence 〈B, C, B, A〉, which is also common to both X and Y , has length 4. The sequence 〈B, C, B, A〉 is an LCS of X and Y , as is the sequence 〈B, D, A, B〉, since there is no common subsequence of length 5 or greater.

In the longest-common-subsequence problem, we are given two sequences X = 〈x₁, x₂, ..., x_m〉 and Y = 〈y₁, y₂, ..., y_n〉 and wish to find a maximum-length common subsequence of X and Y . This section shows that the LCS problem can be solved efficiently using dynamic programming.

Step 1: Characterizing a longest common subsequence

A brute-force approach to solving the LCS problem is to enumerate all subsequences of X and check each subsequence to see if it is also a subsequence of Y , keeping track of the longest subsequence found. Each subsequence of X corresponds to a subset of the indices {1, 2, ..., m} of X. There are 2^m subsequences of X, so this approach requires exponential time, making it impractical for long sequences.

The LCS problem has an optimal-substructure property, however, as the following theorem shows. As we shall see, the natural classes of subproblems correspond to pairs of "prefixes" of the two input sequences. To be precise, given a sequence X = 〈x₁, x₂, ..., x_m〉, we define the ith prefix of X, for i = 0, 1, ..., m, as X_i = 〈x₁, x₂, ..., x_i〉. For example, if X = 〈A, B, C, B, D, A, B〉, then X₄ = 〈A, B, C, B〉 and X₀ is the empty sequence.

Theorem 15.1: (Optimal substructure of an LCS)

Let X = 〈x₁, x₂, ..., x_m〉 and Y = 〈y₁, y₂, ..., y_n〉 be sequences, and let Z = 〈z₁, z₂, ..., z_k〉 be any LCS of X and Y.

If x_m = y_n, then z_k = x_m = y_n and Z_k-1 is an LCS of X_m-1 and Y_n-1.
If x_m ≠ y_n, then z_k ≠ x_m implies that Z is an LCS of X_m-1 and Y.
If x_m ≠ y_n, then z_k ≠ y_n implies that Z is an LCS of X and Y_n-1.

Proof (1) If z_k ≠ x_m, then we could append x_m = y_n to Z to obtain a common subsequence of X and Y of length k 1, contradicting the supposition that Z is a longest common subsequence of X and Y . Thus, we must have z_k = x_m = y_n. Now, the prefix Z_k-1 is a length-(k - 1) common subsequence of X_m-1 and Y_n-1. We wish to show that it is an LCS. Suppose for the purpose of contradiction that there is a common subsequence W of X_m-1 and Y_n-1 with length greater than k - 1. Then, appending x_m = y_n to W produces a common subsequence of X and Y whose length is greater than k, which is a contradiction.

(2) If z_k ≠ x_m, then Z is a common subsequence of X_m-1 and Y. If there were a common subsequence W of X_m-1 and Y with length greater than k, then W would also be a common subsequence of X_m and Y , contradicting the assumption that Z is an LCS of X and Y.

(3) The proof is symmetric to (2).

The characterization of Theorem 15.1 shows that an LCS of two sequences contains within it an LCS of prefixes of the two sequences. Thus, the LCS problem has an optimal-substructure property. A recursive solution also has the overlapping-subproblems property, as we shall see in a moment.

SKEDSOFT

Step 1: Characterizing a longest common subsequence