RL Notes 1: Basics- 4 mins
These notes are based on Siraj Raval’s reinforcement learning course. This week was about basic concepts of markov decision processes (MDPs).
First, what is reinforcement learning (RL)? It is an area of machine learning concerned with how software agents ought to take actions in an environment so as to maximize some notion of cumulative reward (from Wikipedia). For example, an AI might learn to play chess by playing many games, and updating its strategy based on its experiences.
A state is a possible condition the agent can be in the given environment.
Markov decision processes (MDPs) refer to settings where the current state s contains all the available information about the future from s. In other words, the future from s is conditionally independent of the past given that we know s.
For example, think of chess. How we arrived at a particular board position doesn’t matter; the current board position contains all we need to know to continue the game.
Actions are the decisions the agent can take. We can denote the set of actions in the state s as a(s).
Rewards, what the agent is trying to obtain, are real valued functions. They can be functions from a given state, an action, or action-state combinations.
Toy Example: Gridworld
The states are the squares that Mario can be on. The actions are moving up, down, left, right. There are two states with rewards: the top right with +1, and below that with -1. To make the game end, we can regard these two as terminal states: the game ends if you land on these steps.
The example has the Markov property: how you end up on a square doesn’t influence the future of the game.
Rewards and Policies
As the definition above mentioned, the agent will be trying to maximize ‘some notion of cumulative reward’. In some settings, it makes sense to maximize the cumulative reward, but generally we also add discounting, trying to maximize the discounted cumulative future reward:
where is the discount factor (say, 0.9), and is the reward at a given time step.
A policy is a way of acting. Denoted as , it maps a state to an action. Think of it as a strategy. (This definition applies to deterministic policies, but can easily be extended to stochastic policies as well.)
The optimal policy, , is simply the one that maximizes R.
So how do we find the optimal policy? Value functions help us do that.
There are two types of value functions: state value functions and action value functions. Informally, these give the ‘value’ of being in a given state, or of doing a given action.
Mathematically, the state value function is
the expected (discounted cumulative future) reward from being in s under policy .
Similarly, the action value function is
In other words, the expected reward from being in s, doing a, under policy . Note that both functions are dependent on the policy.
A Simple Bellman Equation
So how to we compute these value functions? That’s what the Bellman equations are for. They allow us to express the value of a state (or action) from the value of other states (or actions), allowing us to iteratively solve for the optimal policy.
In a deterministic environment, the Bellman equation for the state value function is
where is the value function under optimal policy, is the reward for taking action a in state s and ending up in s’, and is the value of state s’.
In other words, we take the action that maximizes our reward plus the discounted value of the state we end up in.
Let’s compute the (optimal) state value function for our toy example. Working iteratively backward (using , we can fill in our grid:
Obviously, this kind of brute-force approach won’t work in more complicated situations where we can’t go over all state-action combinations. It’s still a useful way to organize our thinking.