RL Notes 2: Policy and Value Iteration

- 9 mins

Previous week’s notes

(You can find the notebook containing the code here, and a tutorial I found really useful is here)

This week was all about two tools to solve MDPs: policy iteration and value iteration. Both of them are applications of the Bellman equation. Here I will try to explain both without any math, and without any Bellman equations.

Frozen Lake

The example I’ll use is the frozen lake environment of Openai.

env = gym.make('FrozenLake-v0')


It is a similar gridworld to the one covered last week. We start at the top left (at the X), and try to move to the orange square in the bottom right, where we get our reward of + 1 and the game ends. If we move to a black square, we fall into a hole: the game ends without a reward.

There is one important complication: as we move on the blue tiles, we might slip to the side, and end up in a tile different from the one we intended. There is a 33% chance that we move correctly, and a 33% chance for landing in either side. We never slip to the opposite direction of our action though.

As mentioned above, there two ways to solve this game.

Value Iteration

Our goal here is to find the state value function under optimal policy. That means each square will be assigned a number, representing its ‘value’. Once we have that, finding the optimal policy will be easy.

How do we find the state value function? We work iteratively. We first initialize all values to 0. We then find then optimal action for each state. The optimal action is the one that maximizes the expected return (the immediate reward + discounted future value). Once we have that, we assign to the state the maximum expected return (among all actions). We iterate until the state value function converges.

To make our agent prefer shorter paths (i.e. finish the game earlier than later), we set a discount factor of 99%.

def getExpectedActionValue(s, a, V, env, g): 
    '''Gives expected return of action "a" in state "s" (immediate reward + discounted future value)'''
    return sum(prob*(reward + g*V[state]) for prob, state, reward, _ in env.env.P[s][a])

def getOptimalFromState(s, V, env, g):
    '''Gives the best possible return in state "s" from all actions'''
    vals = np.array([getExpectedActionValue(s, a, V, env, g) for a in env.env.P[s].keys()])
    return (vals.max(), vals.argmax())

def getNextV(V, env, g, pi = None):
    '''Gives the next iteration of the value function.'''
    res = np.zeros(V.shape)
    for s in range(env.env.nS):
        if pi is None:
            res[s], _ = getOptimalFromState(s, V, env, g)
            res[s] = getExpectedActionValue(s, pi[s], V, env, g)
    return res

For example, here are the first three iterations of our state value function:


Note how the values ‘spread’ from the bottom right corner.

We do this process until our value function converges.

def ValueIter(env, g, pi = None, eps = 1e-10, maxiter = 10000):
    '''Applies value iteration if pi = None. If pi (policy) is given, it applies policy evaluation'''
    V = np.zeros(env.env.nS); flag_success = False 
    for i in range(maxiter):
        V_old = V.copy()
        V = getNextV(V, env, g, pi)
        if np.max(np.abs(V - V_old)) < eps:
            flag_success = True
    assert flag_success
    return (V, i)


Let’s do a check. Suppose we are on third row, second column (value 0.64). What’s our best action? To move down, obviously. With 33% we end up left or right. What is our return? . It checks out.

Now, let’s get our optimal policy. As a reminder, a policy is an action for every square. It’s easy to find it: for every square let’s look at what would maximize our return according to the state value function we derived.

def getPath(V, env, g):
    '''Gets the optimal path according to state-value function V.'''
    path = np.zeros(env.env.nS)
    for s in range(env.env.nS):
        _, path[s] = getOptimalFromState(s, V, env, g)
    return path


(For those wondering why we move ‘left’ at the start: because if we do, the only movement we can take is ‘slipping’ down.)

Policy Iteration

A second way to solve MDPs is policy iteration.

We start with an arbitrary policy. In this example, I’ll choose to always move right.

Next we do a policy evaluation step: calculating the state value function under the given policy. As a reminder, the value of a state under a policy is the immediate reward + discounted value of the state we end up in. We iteratively update our value function until it converges, just as in the previous section. The key difference: we choose our action based on the given policy (not the optimal one).

Policy evaluation gives us a state value function. For example, here is the value function for always moving to the right:


Now we change our policy according to this value function. That gives us a new policy.


To summarize, our process looks like this:

initial policy -> policy evaluation -> policy improvement -> policy evaluation -> policy improvement -> …

We keep on repeating this process until convergence.

def policyIter(env, g, maxiter = 1000):
    '''Finds optimal policy using policy iteration.'''
    # Start by always going to the right
    pi = np.array([2 for i in range(env.env.nS)])
    flag_success = False
    for i in range(maxiter):
        pi_old = pi.copy()
        V, _ = ValueIter(env, g, pi)  # Policy evaluation step
        pi = getPath(V, env, g)  # Policy Improvement step
        if np.all(pi == pi_old):
            flag_success = True
    assert flag_success
    return (pi, V)  


Comparing Value and Policy Iteration

Do they give the same policy? Yes, they do.

What about runtime? In our example, policy iteration is somewhat faster: 170 ms vs 360 ms.

What reward do would our policy get in the game? Since not all games are won, the outcome is random. On average, our reward is about 0.75, which is equivalent to getting to the target cell about 75% of the time.

def runPolicy(policy, maxiter = 1000, do_render = True, env_name = 'FrozenLake-v0'):
    temp_env = gym.make(env_name)
    s = temp_env.reset()
    cum_reward = 0
    for k in range(maxiter):
        if do_render: 
        s, r, done, _ = temp_env.step(int(policy[s]))
        cum_reward += r
        if done:
    if do_render: 
    return cum_reward

def simAvgReward(policy, iterations = 1000, env_name = 'FrozenLake-v0'):
    totalRewards = 0
    for i in range(iterations):
        totalRewards += runPolicy(policy, do_render=False, env_name=env_name)
    print(f'Average Return from {iterations} simulations is {totalRewards / iterations}')
    return None

Average Return from 1000 simulations is 0.754
Krisztian Kovacs

Krisztian Kovacs

Learning to teach computers to learn.

comments powered by Disqus
rss facebook twitter github gitlab youtube mail spotify lastfm instagram linkedin google google-plus pinterest medium vimeo stackoverflow reddit quora quora