The Rabin-karp Algorithm

Figure 32.5: The Rabin-Karp algorithm. Each character is a decimal digit, and we compute values modulo 13. (a) A text string. A window of length 5 is shown shaded. The numerical value of the shaded number is computed modulo 13, yielding the value 7. (b) The same text string with values computed modulo 13 for each possible position of a length-5 window. Assuming the pattern P = 31415, we look for windows whose value modulo 13 is 7, since 31415 ≡ 7 (mod 13). Two such windows are found, shown shaded in the figure. The first, beginning at text position 7, is indeed an occurrence of the pattern, while the second, beginning at text position 13, is a spurious hit. (c) Computing the value for a window in constant time, given the value for the previous window. The first window has value 31415. Dropping the high-order digit 3, shifting left (multiplying by 10), and then adding in the low-order digit 2 gives us the new value 14152. All computations are performed modulo 13, however, so the value for the first window is 7, and the value computed for the new window is 8.

The solution of working modulo q is not perfect, however, since t_s ≡ p (mod q) does not imply that t_s = p. On the other hand, if t_s ≢ p (mod q), then we definitely have that t_s ≠ p, so that shift s is invalid. We can thus use the test t_s ≡ p (mod q) as a fast heuristic test to rule out invalid shifts s. Any shift s for which t_s ≡ p (mod q) must be tested further to see if s is really valid or we just have a spurious hit. This testing can be done by explicitly checking the condition P[1 ‥ m] = T [s 1 ‥ s m]. If q is large enough, then we can hope that spurious hits occur infrequently enough that the cost of the extra checking is low.

The following procedure makes these ideas precise. The inputs to the procedure are the text T , the pattern P, the radix d to use (which is typically taken to be |Σ|), and the prime q to use.

	RABIN-KARP-MATCHER(T, P, d, q)
 1 n ← length[T]
 2 m ← length[P]
 3 h ← dm-1 mod q
 4 p ← 0
 5 t0 ← 0
 6 for i ← 1 to m           ▹ Preprocessing.
 7     do p ← (dp   P[i]) mod q
 8        t0 ← (dt0   T[i]) mod q
 9 for s ← 0 to n - m       ▹ Matching.
10     do if p = t_s
11           then if P[1 ‥ m] = T [s   1 ‥ s   m]
12                   then print "Pattern occurs with shift" s
13        if s < n - m
14           then ts 1 ← (d(t_s - T[s   1]h)   T[s   m   1]) mod q

The procedure RABIN-KARP-MATCHER works as follows. All characters are interpreted as radix-d digits. The subscripts on t are provided only for clarity; the program works correctly if all the subscripts are dropped. Line 3 initializes h to the value of the high-order digit position of an m-digit window. Lines 4-8 compute p as the value of P[1 ‥ m] mod q and t₀ as the value of T [1 ‥ m] mod q. The for loop of lines 9-14 iterates through all possible shifts s, maintaining the following invariant:

Whenever line 10 is executed, t_s = T[s 1‥ s m] mod q.

If p = t_s in line 10 (a "hit"), then we check to see if P[1 ‥ m] = T [s 1 ‥ s m] in line 11 to rule out the possibility of a spurious hit. Any valid shifts found are printed out on line 12. If s < n - m (checked in line 13), then the for loop is to be executed at least one more time, and so line 14 is first executed to ensure that the loop invariant holds when line 10 is again reached. Line 14 computes the value of t_{s 1} mod q from the value of t_s mod q in constant time using equation (32.2) directly.

RABIN-KARP-MATCHER takes Θ(m) preprocessing time, and its matching time is Θ((n - m 1)m) in the worst case, since (like the naive string-matching algorithm) the Rabin-Karp algorithm explicitly verifies every valid shift. If P = a^m and T = aⁿ, then the verifications take time Θ((n - m 1)m), since each of the n - m 1 possible shifts is valid.

In many applications, we expect few valid shifts (perhaps some constant c of them); in these applications, the expected matching time of the algorithm is only O((n - m 1) cm) = O(n m), plus the time required to process spurious hits. We can base a heuristic analysis on the assumption that reducing values modulo q acts like a random mapping from Σ* to Z_q. (See the discussion on the use of division for hashing in division method. It is difficult to formalize and prove such an assumption, although one viable approach is to assume that q is chosen randomly from integers of the appropriate size. We shall not pursue this formalization here.) We can then expect that the number of spurious hits is O(n/q), since the chance that an arbitrary t_s will be equivalent to p, modulo q, can be estimated as 1/q. Since there are O(n) positions at which the test of line 10 fails and we spend O(m) time for each hit, the expected matching time taken by the Rabin-Karp algorithm is

O(n) O(m(v n/q)),

where v is the number of valid shifts. This running time is O(n) if v = O(1) and we choose q ≥ m. That is, if the expected number of valid shifts is small (O(1)) and the prime q is chosen to be larger than the length of the pattern, then we can expect the Rabin-Karp procedure to use only O(n m) matching time. Since m ≤ n, this expected matching time is O(n).

SKEDSOFT