SKEDSOFT

Data Mining & Data Warehousing

Introduction: Given a biological sequence, such as a DNA sequence or an amino acid (protein), biologists would like to analyze what that sequence represents. In general, given sequences of symbols from some alphabet, we would like to represent the structure or statistical regularities of classes of sequences. In this section, we discuss Markov chains and hidden Markov models—probabilistic models that are well suited for this type of task. Other areas of research, such as speech and pattern recognition, are faced with similar sequence analysis tasks.

To illustrate our discussion of Markov chains and hidden Markov models, we use a classic problem in biological sequence analysis—that of finding CpG islands in a DNA sequence. Here, the alphabet consists of four nucleotides, namely, A(adenine),C(cytosine),G(guanine), andT(thymine).CpGdenotesapair(orsubsequence)ofnucleotides,whereGappears immediately after Calong a DNA strand. The C in a CpG pair is often modified by a process known as methylation (where the C is replaced by methyl-C, which tends to mutate to T).As are sult, CpG pairs occur in frequently in the human genome. However, methylation is often suppressed around promoters or “start” regions of many genes. These areas contain a relatively high concentration of CpG pairs, collectively referred to along a chromosome as CpG islands, which typically vary in length from a few hundred to a few thousand nucleotides long. CpG islands are very useful in genome mapping projects.

Two important questions that biologists have when studying DNA sequences are (1) given a short sequence, is it from a CpG island or not? And (2) given a long sequence, can we find all of the CpG islands within it? We start our exploration of these questions by introducing Markov chains.

Markov Chain

A Markov chain is a model that generates sequences in which the probability of a symbol depends only on the previous symbol.  A Markov chain model is defined by (a) a set of states, Q, which emit symbols and (b) a set of transitions between states. States are represented by circles and transitions are represented by arrows. Each transition has an associated transition probability, aij, which represents the conditional probability of going to state j in the next step, given that the current state is i. The sum of all transition probabilities from a given state must equal 1, that is, Σj∈Q ai j = 1 for all j ∈ Q. If an arc is not shown, it is assumed to have a 0 probability. The transition probabilities can also be written as a transition matrix, A = {aij}.

Hidden Markov Model

Given a long DNA sequence, how can we find all CpG islands within it? We could try the Markov chain method above, using a sliding window. For each window, we could compute the log-odds ratio. CpG islands within intersecting windows could be merged to determine CpG islands within the long sequence. This approach has some difficulties: It is not clear what window size to use, and CpG islands tend to vary in length.

What if, instead, we merge the two Markov chains from above (for CpG islands and non-CpG islands, respectively) and add transition probabilities between the two chains? The result is a hidden Markov model, as shown in Figure 8.10. The states are renamed by adding “ ” and “-” labels to distinguish them. For readability, only the transitions between “ ” and “-” states are shown, in addition to those for the begin and end states. Let π= π1π2 . . . . πL be a path of states that generates a sequence of symbols, x= x1x2 . . . .  xL. In a Markov chain, the path through the chain for x is unique. With a hidden Markov model, however, different paths can generate the same sequence. For example, the state’s C and C- both emit the symbol C. Therefore, we say the model is “hidden” in that we do not know for sure which states were visited in generating the sequence. The transition probabilities between the original two models can be determined using training sequences containing transitions between CpG islands and non-CpG islands.

A Hidden Markov Model (HMM) is defined by

  • a set of states, Q
  • a set of transitions, where transition probability akl = P(πi = l|πi-1 = k) is the probability of transitioning from state k to state l for k, l ∈ Q
  • an emission probability, ek(b) = P(xi = b| πi = k), for each state, k, and each symbol, b, where ek(b) is the probability of seeing symbol b in state k. The sum of all emission probabilities at a given state must equal 1, that is, Σbek = 1 for each state, k.