Statistical Learning

Introduction: The key concepts in statistical learning, are data and hypotheses. Here, the data are evidence—that is, instantiations of some or all of the random variables describing the domain. The hypotheses are probabilistic theories of how the domain works, including logical theories as a special case.

Let us consider a very simple example. Our favorite Surprise candy comes in two flavors: cherry (yum) and lime (ugh). The candy manufacturer has a peculiar sense of humor and wraps each piece of candy in the same opaque wrapper, regardless of flavor. The candy is sold in very large bags, of which there are known to be five kinds—again, indistinguishable from the outside:

h1: 100% cherry
h2: 75% cherry 25% lime
h3: 50% cherry 50% lime
h4: 25% cherry 75% lime
h5: 100% lime

Given a new bag of candy, the random variable H (for hypothesis) denotes the type of the bag, with possible values h1 through h5. H is not directly observable, of course. As the pieces of candy are opened and inspected, data are revealed—D1, D2, : : :, DN, where each Di is a random variable with possible values cherry and lime. The basic task faced by the agent is to predict the flavor of the next piece of candy.1 Despite its apparent triviality, this scenario serves to introduce many of the major issues. The agent really does need to infer a theory of its world, albeit a very simple one.

Bayesian learning simply calculates the probability of each hypothesis, given the data, and makes predictions on that basis. That is, the predictions are made by using all the hypotheses, weighted by their probabilities, rather than by using just a single “best” hypothesis. In this way, learning is reduced to probabilistic inference. Let D represent all the data, with observed value d; then the probability of each hypothesis is obtained by Bayes’ rule:

.......................(1)
Now, suppose we want to make a prediction about an unknown quantity X. Then we have
..........................(2)
where we have assumed that each hypothesis determines a probability distribution over X.
This equation shows that predictions are weighted averages over the predictions of the individual hypotheses. The hypotheses themselves are essentially “intermediaries” between the raw data and the predictions. The key quantities in the Bayesian approach are the hypothesis prior, P(hi), and the likelihood of the data under each hypothesis, P(d|hi).
For our candy example, we will assume for the time being that the prior distribution over h1; : : : ; h5 is given by h0:1; 0:2; 0:4; 0:2; 0:1i, as advertised by the manufacturer. The I.I.D. likelihood of the data is calculated under the assumption that the observations are i.i.d.—that is, independently and identically distributed—so that
...........................(3)
For example, suppose the bag is really an all-lime bag (h5) and the first 10 candies are all lime; then P(djh3) is 0.5¹⁰, because half the candies in an h3 bag are lime.2 Figure 20.1(a) shows how the posterior probabilities of the five hypotheses change as the sequence of 10 lime candies is observed. Notice that the probabilities start out at their prior values, so h3 is initially the most likely choice and remains so after 1 lime candy is unwrapped. After 2

Figure 20.1 (a) Posterior probabilities P(hijd1; ..... ; dN) from Equation (1). The number of observations N ranges from 1 to 10, and each observation is of a lime candy. (b) Bayesian prediction P(dN 1 =limejd1; : : : ; dN) from Equation (2).

lime candies are unwrapped, h4 is most likely; after 3 or more, h5 (the dreaded all-lime bag) is the most likely. After 10 in a row, we are fairly certain of our fate. Figure 20.1(b) shows the predicted probability that the next candy is lime, based on Equation (2). As we would expect, it increases monotonically toward 1.

The example shows that the true hypothesis eventually dominates the Bayesian prediction. This is characteristic of Bayesian learning. For any fixed prior that does not rule out the true hypothesis, the posterior probability of any false hypothesis will eventually vanish, simply because the probability of generating “uncharacteristic” data indefinitely is vanishingly small. (This point is analogous to one made in the discussion of PAC learning in Chapter 18.) More importantly, the Bayesian prediction is optimal, whether the data set be small or large. Given the hypothesis prior, any other prediction will be correct less often.
The optimality of Bayesian learning comes at a price, of course. For real learning problems, the hypothesis space is usually very large or infinite. In some cases, the summation in Equation (2) (or integration, in the continuous case) can be carried out tractably, but in most cases we must resort to approximate or simplified methods.
A very common approximation—one that is usually adopted in science—is to make predictions based on a single most probable hypothesis—that is, an hi that maximizes P(hijd). This is often called a maximum a posteriori or MAP (pronounced “em-ay-pee”) hypothesis. Predictions made according to an MAP hypothesis hMAP are approximately Bayesian to the extent that P(Xjd) P(XjhMAP). In our candy example, hMAP =h5 after three lime candies in a row, so the MAP learner then predicts that the fourth candy is lime with probability 1.0—a much more dangerous prediction than the Bayesian prediction of 0.8 shown in Figure 20.1. As more data arrive, the MAP and Bayesian predictions become closer, because the competitors to the MAP hypothesis become less and less probable. Although our example doesn’t show it, finding MAP hypotheses is often much easier than Bayesian learn ing, because it requires solving an optimization problem instead of a large summation (or integration) problem.

SKEDSOFT