# RL Notes 3: Exploration vs Exploitation

- 16 mins(You can find the notebook containing the code here)

Until now, we used environments were had perfect information: we knew the rewards and transition probabilities. In those contexts, we could use policy or value iteration to find our optimal policy.

Of course, the interesting cases are when we don’t have any information whatsoever. We don’t know the transition probabilities, and we may not even know the rewards.

Let’s briefly review our setting. We’re playing a game. There are certain states we can be in, and actions we can take. We don’t know the result of an action. Sometimes we get a reward. We don’t know when this happens, but we want more of it.

How do we solve situations like this? We solve them by repeatedly simulating the environment. We use our results estimate the unknown quantities (the fancy name for this technique is Monte Carlo methods).

I discuss some concrete strategies below, but one theme will be common to them all:

### The exploration-exploitation dilemma

At any point we have two options available to us: we can **explore** the possibilities we have, or we can **exploit** the policy we judged so far to be the best. Most strategies will start with exploration in the beginning so that we can gather information. As we do so, we gradually shift towards exploitation of the strategy we judged to be the best.

If we keep on exploring for too long, we waste valuable opportunities to collect rewards.

If we exploit too early, we may be locked into a sub-optimal solution – had we gathered more information, we could have discovered a better policy.

### Strategies illustrated here

In this post, I will discuss the following, very common policies:

- Greedy
- Explore First, then Greedy,
- Epsilon-Greedy
- Optimistic Initialization
- Optimism in the Face of Uncertainty
- Thompson Sampling

### The Q Function

Many policies will refer in one way or another to the Q function. This function maps state-action combinations to average returns. It is basically a big lookup table.

## The Environment: Cheating Roulette

### Openai’s simple roulette

Openai contains a simplified roulette environment. The rules are as follows:

- You have a roulette wheel with numbers 0-36.
- Every round you can select a number, or you can leave the roulette table, ending the game.
- If you select 0, and it comes up, you receive $36. If it doesn’t come up, you loose $1.
- If you select a non-zero number, you receive $1 if the parity of your number matches the parity of the roll (0 doesn’t count). Otherwise, you loose $1. Example: if you select 5 and 17 is rolled you receive $1.
- Each number has the same probability of showing up.

Simple math will tell you that this roulette has a negative expected value (just as in real life). The best action the agent can take is simply leave the table (again, as in real life).

### Cheating roulette modification

That’s not a very interesting setting however. So we will modfiy this roulette. In particular, we will change it so that

- 0 has a 0.033 probability of showing up (instead of ~0.027).
- Every other number is equally likely.
- The payoff is the same as in the original game.

The effect of this change is that playing **0** has now positive expected value. However, it is extremely difficult to find out about this feature. After all, even with the modification, 0 will only be rolled once in 30 rounds (on average).

Moreover, not knowing the **0** cheat (playing policies randomly), the expected value is still negative. That makes leaving the table still attractive, although in this case, suboptimal.

Basically, I have set up an environment where it is easy for an agent to get trapped in a sub-optimal solution (leaving the table).

### The Simulations

I will consider to be a game of roulette to be 100 rounds. The aim is to maximize the money an agent makes in these 100 rounds.

Each strategy will play 10 000 games (each 100 rounds). After each game, the agent will be able to update its policy. (Why not update after each round? Because that would increase computation time, as some strategies rely on the entire history of the game.)

## The Policies

### Basic Benchmark Policies

Before we discuss the strategies listed above, let’s have some useful benchmarks.

The first option is to always leave the table; never play. Obviously, that will have a return of 0.

```
class DontPlay(Policy):
def pi(self): return 37 # This action is leaving the table
```

More interestintly, what if we play in ‘God mode’, knowing the **0** cheat? In that case, we will earn, on average, about $22 per game (per 100 rounds).

Below is a plot of the money we earn in each simulated round. Note that even knowing the cheat, in most games we still loose money, as we need 4 zeros per game to get a positive profit. Our positive average is due to the few games where zeros came up many times.

```
class OptimalPlay(Policy):
def pi(self): return 0
```

Just to check, what happens if we always select a random number? The return is somewhat negative: we loose money on average, about $1 per game (per 100 rounds).

```
class RandPolicy(Policy):
def pi(self): return self.env.action_space.sample()
```

### Greedy Policy

Our simplest policy is a greedy policy. A greedy policy simply plays the number with the highest Q value (the highest historical return). Generally, it’s not a good choice, as it may lock into a sub-optimal solution. (There is a chance that the greedy policy will find the optimal solution, just not a reliably big one.)

Benefits

- Easy to implement, easy to explain.
- No hypterparameters to tune.

Drawbacks

- No exploration, often gets stuck at sub-optimal play.

```
class Greedy(Policy):
def pi(self): return self.getMaxQ()
```

As we can see, for the first couple of rounds our policy followed some random choices, as some numbers, just by chance, got positive rewards. However, very quickly our policy concludes that it’s best to leave the game - and earns 0 rewards afterward.

Below I plot the actions taken in the first and last 1000 rounds.

## Explore First Then Greedy

Since our greedy policy had the drawback of not doing enough exploration, let’s add explicit exploration rounds.

We will let our policy follow random actions for the first **k** games. After **k** games, it will follow a greedy policy.

Benefits

- Simple to implement and explain.
- Includes both exploration and exploitation.

Drawbacks

- Need to choose k manually.
- It is a priori unclear how much exploration/exploitation one should do in a given game.
- Exploration isn’t done intelligently, it just chooses random actions.

```
class ExpFirst(Policy):
def __init__(self, exp_num):
super().__init__()
self.exp_num = exp_num
def pi(self):
if self.episode <= self.exp_num:
return self.env.action_space.sample()
else:
return self.getMaxQ()
```

In this example, we explore for the first 1000 games. In this particular case, it managed to find the best policy afterwards. Note however that there is still luck involved, and there is no guarantee that if we replay the same policy it will find the optimum again (it depends on how often 0 land in the first 1000 games).

The actions taken in the first and last 1000 rounds:

Similar results with 2000 games of exploration.

Another pitfall: exploring too long. Here we explore for 5000 rounds (half the time) before exploiting. This means we are more likely going to find the optimal policy, but we reduce our exploitation opportunities: our average profit is little more than half than in the first case.

## Epsilon-Greedy

Another way to modify the pure greedy strategy is to add some noise to it. Specifically, follow the greedy strategy with probability , and do a random action with probability .

Over time, we want to decrease , so that we gradually shift from exploration to exploitation. One way to parameterize:

where n is the number of games we played so far.

Benefits

- Still relatively simple to implement and explain.
- Even if it gets stuck at a suboptimal solution, it has the chance to ‘break out’.
- May adopt to changing environments.

Drawbacks

- Unintelligent exploration
- Need to choose hyper-parameters well.

```
class EpsGreedy(Policy):
def __init__(self, k):
super().__init__()
self.k = k
def pi(self):
epsilon = self.k / max(self.episode, 1)
if np.random.rand() <= epsilon:
return self.env.action_space.sample()
else:
return self.getMaxQ()
```

A demonstration of the luck involved: even though we explore twice as long as in the previous case, we were unlucky and didn’t find the optimal solution.

More exploration. We find the optimum, but reduced our profit by exploring too long.

## Optimistic Initialization

This is not so much a stand-alone strategy, as it is a hack on our greedy policy. We set the initial Q values to be unrealistically high, then follow a greedy strategy. The effect of this hack is that our policy will try out all available actions first, as our initialized Q values are higher than what the policy can achieve.

While strictly better than a pure greedy strategy, it suffers from all its drawbacks.

Note however that this hack is available for almost any policy, not just greedy. If we want to explore all actions before committing to our policy, we initialize all Q’s to be unrealistically high. If the number of possible actions isn’t too high, this is a reasonable thing to do.

```
class optInitPolicy(Policy):
def __init__(self):
super().__init__()
self.Q = 100*np.ones((self.nS, self.nA))
def pi(self): return self.getMaxQ()
```

In this case, the hack didn’t make a difference relative to the greedy policy.

## Optimism in the Face of Uncertainty

Our policies so far haven’t implemented a clever way to explore our action space. They just followed random actions.

We can do better. To give some intuition, consider the following two action histories:

- Number 7 yielded $1 once, and $-1 twice.
- Number 8 yielded $1 100 times, and $-1 200 times.

The average return is the same for both cases (-0.33). However, we’re a lot more confident that number 8 has a negative reward, since we played it 100 times more. Number 7 could have just had a few unlucky flips.

Thus, we will favor uncertain actions. In the above example, we favor 7. If 7 is indeed a bad number, successive trials will prove so.

We select a **p** such that an action not chosen is at most **p** percent likely to be the optimal one. We then use Hoeffding’s Inequality to choose our action according to

where H is a measure of uncertainty. In the case of bernoulli random variables, it would be equal to

where is the frequency of the given state-action pair.

In our case, the formula doesn’t quite work, as **0** isn’t a [0, 1] (or [-1, 1]) variable. Our policy will reduce the uncertainty in the probability that a number appears, but it won’t reduce the uncertainty in the expected value (which is what we care about).

Usually, this strategy is not a bad approach (unless the payoff distribution is extremely skewed, as in this case).

Benefits

- Explores the action space intelligently, by favoring actions with uncertain payoffs.
- Hyperparameter has a meaningful interpretation.

Drawbacks

- Formula is only exact for [0, 1] binary variables.
- Doesn’t handle highly skewed payoffs well.

```
class optUnc(Policy):
def __init__(self, p = 0.05):
super().__init__()
self.H = np.ones(self.Q.shape)
self.H[0, 37] = 0
self.p = p
def pi(self):
Q = self.Q[self.state, :]
H = self.H[self.state, :]
return np.argmax(Q+H)
def updateH(self):
for s in range(self.nS):
for a in range(self.nA):
if (not self.returns[(s, a)] == []) & (a != 37):
n = len(self.returns[(s, a)])
self.H[s, a] = np.sqrt(-np.log(self.p)/(2 * n))
def callOnEnd(self): self.updateH()
```

Voila: after only few rounds of exploring, we find the best policy.

## Thompson Sampling

Here is another way to intelligently explore: let’s sample actions by the probability that they are optimal. This way, actions that had good historical returns will be favored, but we will also occasionally sample actions that didn’t have good returns.

Specifically, we will assume a prior distribution for the expected payoff for each number. In this example, we use a Beta distribution with parameters (1, 1). Then, as we collect rewards, we update these parameters.

In each round, we take a sample from the distribution of each action. We choose the action with the highest sampled value.

Benefits

- Intelligently explores the action space.

Drawbacks

- Doesn’t handle data skew well (at least not with a beta prior).

```
class ThompSamp(Policy):
def __init__(self):
super().__init__()
self.alphas = np.ones(self.Q.shape)
self.betas = np.ones(self.Q.shape)
def pi(self):
s = self.state
n = self.alphas[s, :].shape[0]
sampl = np.ones(n)/2
for i in range(n - 1):
sampl[i] = np.random.beta(self.alphas[s, i], self.betas[s, i])
self.current_action = np.argmax(sampl)
return self.current_action
def callOnEnd(self):
for state, action, G in self.states_actions_returns:
if G > 0:
self.alphas[state, action] += G
elif self.reward < 0:
self.betas[state, action] += -G
```

This graph is interesting. Only towards the second half does the agent swing to the optimal policy. The likely reason: until then zeros didn’t have a good track record (remember, they only come up once in 30 turns on average). Not having a good track record made it less likely that they got selected.

However, by (weighted) random sampling, every now and then we did select a zero. And one success had a major update on distribution weights. Zeros got selected more and more often, and a positive cascade got built.