Reinforcement Learning

A supervised learning agent needs to be told the correct move for each position it encounters, but such feedback is seldom available. In the absence of feedback from a teacher, an agent can learn a transition model for its own moves and can perhaps leam to predict the opponents moves, but without some feedback about what is good and what is bad, the agent will have no grounds for deciding which move to make. The agent needs to know that sometrung good has happened when it (accidentally) checkmates the opponent, and that something bad has happened when it is checkmated-or vice versa, if the game is suicide chess. This kind of feedback is called a reward, or reinforcement.

In games like chess, the reinforcement is received only at the end of the game. In other environments, the rewards come more frequently. In ping-pong, each point scored can be considered a reward; when learning to crawl, any forward motion is an acruevement. Our framework for agents regards the reward as part of the input percept, but the agent must be "hardwired" to recognize that part as a reward rather than as just another sensory input. Thus, animals seem to be hardwired to recognize pain and hunger as negative rewards and pleasure and food intake as positive rewards. Reinforcement has been carefully studied by animal psychologists for over 60 years.

An optimal policy is a policy that maximizes the expected total reward. The task of reinforcement learning is to use observed rewards to learn an optimal (or nearly optimal) policy for the envirorunent. prior knowledge of either. Imagine playing a new game whose rules you do not know; after a hundred or so moves, your opponent announces, "You lose." This is reinforcement leaming in a nutshell.

In many complex domains, reinforcement learning is the only feasible way to train a program to perfonn at high levels. For example, in game playing, it is very hard for a human to provide accurate and consistent evaluations of large numbers of positions, which would be needed to train an evaluation function directly from examples. Instead, the program can be told when it has won or lost, and it can use this information to learn an evaluation function that gives reasonably accurate estimates of the probability of winning from any given position. Similarly, it is extremely difficult to program an agent to fly a helicopter; yet given appropriate negative rewards for crashing, wobbling, or deviating from a sel course, an agent can Jeam to fly by itself.

Reinforcement learning might be considered to encompass all of AI: an agent is placed in an environment and must learn to behave successfully therein. To keep the chapter manageable, we will concentrate on simple environments and simple agent designs. For the most pat1, we will assume a fully observable environment, so that the current state is supplied by each percept. On the other hand, we will assume that the a.gent does not know how the environment works or what its actions do, and we will allow for probabilistic action outcomes. Thus, the agent faces an unknown Markov decision process.

A utility-based agent learns a utility function on states and uses it to select actions that maximize the expected outcome utility.
A Q-learning agent learns an action-utility function, or Q-function, giving the expected utility of taking a given action in a given state.
A reflex agent learns a policy that maps directly from stales to actions.

A utility-based agent must also have a model of the environment in order to make decisions, because it must know the states to which its actions will lead. For example, in order to make use of a backgammon evaluation function, a backgammon program must know what its legal moves are and how they affect the board position. Only in this way can it apply the utility function to the outcome states. A Q-learning agent, on the other hand, can compare the expected utilities for its available choices without needing to know their outcomes, so it does not need a mcdel of the envirorunent. On the other hand, because they do not know where their actions lead, Q-leaming agents cannot look ahead; this can seriously restrict their ability to learn, as we shall see.

SKEDSOFT