Jekyll2018-11-23T09:01:55+00:00https://krisztiankovacs.com/Krisztian KovacsLearning to teach computers to learn.RL Notes 7: Genetic Algorithms2018-11-23T02:11:44+00:002018-11-23T02:11:44+00:00https://krisztiankovacs.com/rl-notes-7-genetic-algorithms<p><a href="/rl-notes-6-experience-replay/">Previous week’s notes</a></p>
<p>(You can find the notebook teaching a 2D robot to walk <a href="https://github.com/kk1694/rl_course/blob/master/Week_7.ipynb">here</a>)</p>
<p>The previous couple of posts were about optimizing RL agents, whether with <a href="/rl-notes-5-augmented-random-search/">augmented random search</a> or <a href="/rl-notes-6-experience-replay/">an experience replay buffer</a>. Now let’s add one more method to the list: genetic algorithms.</p>
<p>Before explaining genetic algorithms, there is one other major change. Last week I was learning the Q (action-value) function, then acting based on that. However, if the action space is vast (or infinite), that method wouldn’t work. After all, we would need to learn an action value for every action! Instead, I’ll be learning the policy, the mapping from states to actions, directly. For each state, I want to output the best action.</p>
<p>So what are genetic algorithms? They take inspiration from natural selection. We start with a population of different parameters, select the best performing ones (discard the rest), create cross-overs (children), mutate a subset of children, and repeat.</p>
<p>The main benefit of genetic algorithms is that they can be used for any optimization problem. We don’t need derivates. Our objective function doesn’t even have to be continuous! Therefore, we can use this method for both parameter optimization and hyperparameter search!</p>
<p>The specific steps:</p>
<ol>
<li>Start with an initial population of different settings.</li>
<li>Calculate the fitness of each member of the population.</li>
<li>Keep the top n performing members, discard the rest.</li>
<li>Optionally, create cross-overs between selected members. Sample A and B from the population, and randomly mix their parameters to create a cross-over (their ‘child’). Create k cross-overs, and add them to the population.</li>
<li>Mutate some members. To mutate a member, add random noise to its parameters.</li>
<li>Repeat from step 2.</li>
</ol>
<p>Optionally, at step 3 we can also select some non-top-performing settings. Including them will create more interesting cross-overs.</p>krisztiankovacsPrevious week’s notesRL Notes 6: Experience Replay2018-11-20T02:11:44+00:002018-11-20T02:11:44+00:00https://krisztiankovacs.com/rl-notes-6-experience-replay<p><a href="/rl-notes-5-augmented-random-search/">Previous week’s notes</a></p>
<p>(You can find the notebook learning a simple 2D game <a href="https://github.com/kk1694/rl_course/blob/master/week6.ipynb">here</a>)</p>
<p>Our next step in making an RL agent more intelligent: let’s replace the shallow network that calculates the Q function with a deep one. A move that doesn’t seem too surprising.</p>
<p>We could update the weights of such a network as we did <a href="/rl-notes-5-augmented-random-search/">last week</a> using augmented random search. Let’s discuss a different way.</p>
<p>We can transform our RL problem into a (quasi) supervised learning problem. Suppose that for each state and action combination we knew the Q (action-value) function. Then we could create a neural network that connects the inputs (the state) to the output (the Q function for every action). After defining a proper loss function (let’s say MSE), running gradient descent for a few epochs, we basically solved our RL problem. After all, we have fitted a Q function that gives the value of each action in each state. Thus, we know how to act: choose the action that maximizes Q!</p>
<p>Of course, it’s not that simple, as we don’t have a dependent (Q) variable for such a supervised learning task. But just as with our <a href="/rl-notes-2/">grid-world</a> example early on, we can work iteratively.</p>
<p>We start with a randomly initialized Q function. We play a couple of rounds with that Q function, and save the game history: the states, actions, rewards, next states at each point of time. Now we set up our supervised learning problem. Our independent variables are our states and actions. Our dependent variable is constructed: it is the immediate reward, plus the discounted maximum Q value of the next state.</p>
<p>Our Q function is thus doing two jobs. It is used for evaluating the next state, thus providing the dependent variable for our supervised learning problem. And it is also the object that is trained. Surprisingly, such an iterative strategy works, and our Q function converges. For optimization purposes, it is better to keep two separate copies of the same network: one for evaluation, one for training. We then periodically copy the weights we learned over to our evaluation network.</p>
<p>Note that we should also do some exploration, and not just follow our Q function. An <a href="/rl-notes-3-exporation-vs-exploitation/">epsilon-greedy</a> strategy works: most of the time, we do the (so-far-judged-to-be) optimal action, but every now and then we take a random action instead.</p>
<p>Step-by-step our strategy is the following:</p>
<ol>
<li>Initialize Q with random weights.</li>
<li>Make a copy of Q to serve as evaluation, call it <script type="math/tex">Q'</script>.</li>
<li>Play the game for a few rounds using Q and an epsilon-greedy strategy. Save (state, action, reward, next state) in a buffer.</li>
<li>Randomly sample mini-batches from the buffer. Construct the target to be <script type="math/tex">y_t = reward_t + max Q'_{t+1}</script></li>
<li>Using a loss function and gradient descent, update Q.</li>
<li>Periodically, set <script type="math/tex">Q'</script> equal to <script type="math/tex">Q</script>.</li>
</ol>krisztiankovacsPrevious week’s notesRL Notes 5: Augmented Random Search2018-11-08T02:11:44+00:002018-11-08T02:11:44+00:00https://krisztiankovacs.com/rl-notes-5-augmented-random-search<p><a href="/rl-notes-4-temporal-differences/">Previous week’s notes</a></p>
<p>(You can find the notebook teaching a 2D robot to walk <a href="https://github.com/kk1694/rl_course/blob/master/Midterm.ipynb">here</a>)</p>
<p>Suppose we want to teach a robot how to walk. Each time-step, we have to tell it how much to rotate each joint, with what velocity, etc. In other words, we have to give it a vector that controls its joint movements. Each moment we also receive some information from the environment: where our robot is at, what speed it is going, etc.</p>
<p>A shallow neural net is a simple architecture that maps the input vector to the necessary outputs. However, we have to choose appropriate weights.</p>
<p>Normally we train a neural net with gradient descent. But gradient descent may not always be possible: we might not have a gradient of our loss function, or it may simply be computationally expensive. Augmented random search is an alternative.</p>
<p>The basic algorithm for augmented random search is similar to finite differences:</p>
<ol>
<li>Create a random perturbation <script type="math/tex">\delta</script> of the same shape as our parameter matrix <script type="math/tex">\theta</script> (small positive or negative random amounts).</li>
<li>Make two copies of our parameter matrix, one in which we add <script type="math/tex">\delta</script>, on in which we subtract it (resulting in <script type="math/tex">\theta^+</script> and <script type="math/tex">\theta^-</script>).</li>
<li>Simulate our agent with these two new matrices, and record the rewards (<script type="math/tex">r^+</script> and <script type="math/tex">r^-</script>).</li>
<li>Update our parameter matrix by <script type="math/tex">\theta_{new} = \theta + \alpha(r^+ - r^-)\delta</script>, where <script type="math/tex">\alpha</script> is the learning rate.</li>
</ol>
<p>The algorithm is quite simple. To make it more effective, we can take a couple of additional steps:</p>
<ul>
<li>Normalizing inputs before feeding it into the neural net.</li>
<li>Instead of simulating one perturbation, simulating <em>n</em>. Then keeping the top <em>k</em> performing ones and averaging them.</li>
</ul>krisztiankovacsPrevious week’s notesTips and Tricks from Andrew Ng’s class2018-10-18T02:11:44+00:002018-10-18T02:11:44+00:00https://krisztiankovacs.com/andrew-ng-tips-and-tricks.<p>Andrew Ng mentioned lots of tips and tricks during his <a href="/review-blitzing-through-the-coursera-dl-specialization/">Coursera class</a>. Below I list a miscellaneous subset. Many are not even DL specific. They are in no particular order; some people might find them obvious (especially with hindsight). I found them helpful and interesting.</p>
<h3 id="random-search-beats-grid-search">Random Search beats Grid Search</h3>
<p>We often have to optimize multiple hyperparameters. For example, suppose you have a standard feed-forward neural net, with 1 hidden layer. You are trying to decide on the number of hidden units, and the amount of dropout. You have resources to fit multiple models, and compare their results on your validation set.</p>
<p>One way to search the hyperparameter space is grid-search: run all combinations of a discrete subset of parameters. For example, you might consider a network with 50, 100, 150, and 200 hidden units; and dropout rates of 0.1, 0.2, … 0.5. Take all combinations of these values, and you get 20 different settings to run your network with.</p>
<p>Or you can just take 20 random samples from relevant parameter range: 50-200 for hidden units, and 0.1-0.5 for dropout.</p>
<p>Why is the second approach better? Because not all hyperparameters have the same impact. Suppose that in this case dropout doesn’t make a big difference, but the number of hidden units matters. With grid-search, you have 20 runs, but only sample 4 distinct values of the parameter that matters. With random search, you take 20 samples for the number of hidden units as well.</p>
<p>This effect gets even more pronounced in higher dimensions. The number of runs required for grid-search explodes. Why waste runtime on parameters that might not even matter?</p>
<h3 id="multiple-minima-are-not-an-issue-in-deep-learning">Multiple minima are not an issue in deep learning</h3>
<p>We often illustrate gradient descent with a 2 dimensional picture. In two dimensions it’s easy to picture a curve with multiple minima where gradient descent (or any other optimization algorithm) finds a sub-optimal local minima. This intuition doesn’t carry over to higher dimensions.</p>
<p>Two have a local minimum, your directional derivative has to be 0 in all directions. In very high dimensions, the probability that happens is fairly small. If you find a local minimum, chances are you hit the global one.</p>
<p>However, your surface will be full of (relatively) flat areas and saddle points (where your gradient is 0 in some directions, but not all). Flat surfaces and saddle points are problematic because they make your update steps tiny, and your optimization algorithm may not find a minimum in a reasonable amount of time.</p>
<h3 id="dont-use-rank-1-arrays">Don’t use rank 1 arrays</h3>
<p>If you sum an (n, n) matrix in numpy, the result is a (n,) rank 1 array. This result can lead to bugs (what if you summed over the wrong axis?), and makes broadcasting unclear. It’s better to avoid these rank 1 arrays for clarity.</p>
<p>Instead, we should sum to a (n, 1) or (1, n) vector. That makes it explicit over which dimension we are summing. We just need to use the option ‘keepdims = True’.</p>
<p>Also, we should pepper our code with assertions that check the shape of our arrays. (Putting ‘assert W.shape = (x, y)’ into our code). The computational overhead is minimal, and the assertions help us find errors quickly.</p>
<h3 id="build-something-quickly-iterate-fast">Build something quickly, iterate fast</h3>
<p>Andrew’s recurring suggestion is to quickly iterate. Instead of trying to build a hyper-complex model from scratch, build a simple one, then look at where the largest margins of improvement are.</p>
<p>To facilitate quick iteration, a single number evaluation metric is extremely helpful. It’s hard to iterate if we have multiple metrics, each preferring a different model. To deal with multiple metrics we have two strategies.</p>
<p>First, we can combine them. For example, we can turn precision and recall into an f1 (or f2) score.</p>
<p>Alternatively, we can designate satisficing metrics. Satisficing metrics only have to be over (or under) a certain threshold. For example, we might say that the memory requirement our model is a satisficing metric. We want our model to fit into memory, but after that we don’t care about its size. We would have our optimizing metric, say the f1 score, with the constraint that our model fits into memory.</p>
<h3 id="changing-proportions-of-traindevtest-sets-for-big-data">Changing proportions of train/dev/test sets for big data</h3>
<p>60/20/20 was a typical train/validation/test split proportion for machine learning in the pre-deep-learning era. However, with tens of millions of examples, this split doesn’t make sense anymore. After all, the validation set is only used for monitoring performance and optimizing hyper-parameters, and the test set only serves to calculate our error. Often, 1% of our data is enough for them.</p>
<h3 id="human-error-helping-with-biasvariance">Human error helping with bias/variance</h3>
<p>For many deep learning tasks it is helpful to know, at least approximately, what human error is. For naturally occurring tasks, such as image classification, human error is plausibly close <a href="https://en.wikipedia.org/wiki/Bayes_error_rate">Bayes optimal error</a>. Knowing human/Bayes error we know how much our model can to improve.</p>
<p>Moreover, knowing human error helps us determine if our algorithm suffers from high bias. We normally treat the training set error as a measure of our algorithm’s bias, and the difference between our trianing and validation error as the algorithms variance.</p>
<p>That’s not fair though, as some of the training set error might be unavoidable. For example, when doing image classification, some of our training examples might be mislabeled. Even a perfect model would have training set error.</p>
<p>Instead, we should focus on the avoidable bias. Knowing human error comes in handy here. If our model has a 2% training error on an image classification task, but humans also have a 2% error (maybe because of mislabeled examples), we know that reducing bias is not the best margin for improvement. However, if humans have a ~0% error, it is worth trying bias reducing techniques.</p>
<h3 id="incorrectly-labeled-training-examples">Incorrectly labeled training examples</h3>
<p>Not necessarily a problem, as long as there aren’t too many of them, and the mislabeling isn’t systematic.</p>
<h3 id="what-to-do-if-traindev-sets-have-different-distributions">What to do if train/dev sets have different distributions</h3>
<p>Validation and test sets should always have the same distributions, but sometimes our training set can have a different one.</p>
<p>For example, suppose we are building an app that classifies pictures into cats vs. non-cats. Our app will be used for images recorded on a smartphone. We have 10000 such images that we can split into a validation and test set.</p>
<p>Our training set is slightly different. We have 1 million pictures downloaded from the internet. These aren’t exactly the same as the ones recorded on the phone: they might be higher resolution, the cat is usually in the middle, etc. But we have a lot more of them.</p>
<p>We fit a model. It has 2% training error, but the error on our validation set is a whopping 8%. Does our model suffer from high variance, or is it just our training and validation sets being different?</p>
<p>To answer this question, we can create a separate ‘train-dev’ set. This is data of the same distribution as our training set that isn’t used for training.</p>
<p>In our example, we would set aside a subset of the 1 million images from the internet. We use the rest for training. Now we can compare errors.</p>
<p>If our model has a 8% error on this train-dev set, we know that it suffers from high variance. We could try variance reducing techniques, such as increasing regularization. On the other hand, if it scores 2% on the train-dev set (same as the training error), we know that the problem is the differing distributions for training and validation. We need to collect more mobile phone data so we can train our model on that.</p>
<h3 id="few-shot-learning-with-siamese-network">Few shot learning with Siamese network</h3>
<p>Siamese networks look <a href="https://www.quora.com/What-are-Siamese-neural-networks-what-applications-are-they-good-for-and-why">awesome</a>.</p>
<h3 id="word-embeddings-dont-need-a-complex-model">Word embeddings don’t need a complex model</h3>
<p>A simple model like skip-grams suffices. To learn embeddings in an efficient way (avoiding the massive computations needed for a softmax classifier) we can use algorithms such as negative sampling or Glove.</p>krisztiankovacsAndrew Ng mentioned lots of tips and tricks during his Coursera class. Below I list a miscellaneous subset. Many are not even DL specific. They are in no particular order; some people might find them obvious (especially with hindsight). I found them helpful and interesting.Review: Blitzing Through the Coursera DL Specialization2018-10-14T02:11:44+00:002018-10-14T02:11:44+00:00https://krisztiankovacs.com/review-blitzing-through-the-coursera-dl-specialization<p>I recently completed Andrew Ng’s <a href="https://www.coursera.org/specializations/deep-learning">deep learning specialization</a> on Coursera.</p>
<p><strong>TLDR;</strong> it’s a great resource. If you are familiar with ‘vanilla’ machine learning and would like to understand deep learning, it’s the perfect place to start. The course teaches both basic DL theory, and tips and tricks on how to optimize your workflow. The lectures are really easy to understand and follow (great), and the coding assignments are also extremely simple (not so great). Coursera says that the course takes between 4-5 months to complete, but if you’re dedicated it should take less than one month.</p>
<h2 id="overview">Overview</h2>
<p>The specialization consists of 5 courses.</p>
<ol>
<li><strong>Neural Networks and Deep Learning.</strong> Covers basic feedforward networks. Explains shallow & deep networks, parameter initialization, gradient descent, backpropagation, but doesn’t dwell too much on the math and proofs.</li>
<li><strong>Improving Deep Neural Networks: Hyperparameter tuning, Regularization and Optimization.</strong> This module explains the ideas that make NNs work well in practice. One such topic is regularization: weight decay, dropout, batch norm. Enhanced optimization algorithms are also covered: momentum, RMSprop, ADAM. Plus some miscallenous topics: hypterparameter tuning, exploding/vanishing gradients, and gradient checking, etc.</li>
<li><strong>Structuring Machine Learning Projects.</strong> A short module about overall DL project strategy and practical tips and tricks. Topics are cover different aspects of what to focus on and how analyze the errors of your algorithm.</li>
<li><strong>Convolutional Neural Networks.</strong> A big overview of computer vision. Goes over building blocks like convolutions, padding, strides, pooling; covers common CNN architectures; and also explains topics such as object-detection, Siamese networks, or neural style transfer.</li>
<li><strong>Sequence Models.</strong> Exactly what you would expect: RNNs, GRUs, LSTMs. This course also goes into the different ways to create word embeddings. At the end, it explains attention models.</li>
</ol>
<h2 id="prerequisites">Prerequisites</h2>
<p>The first course assumes no background, but to get most out of it, you should know some machine learning and linear algebra.</p>
<p>Each subsequent course builds on the knowledge of the previous, it’s best to take them as a sequence.</p>
<h2 id="lectures">Lectures</h2>
<p>I don’t have much to say about them; they are fantastic. Andrew’s explanations are simple, even for somewhat advanced concepts. He doesn’t shy away from using math, but limits it to where it makes a practical difference (we are spared the proof of backpropagation). Sometimes he discusses code, but not too often; that is left to the assignments. The lectures are made up of ~10min long videos - the perfect length.</p>
<h2 id="programming">Programming</h2>
<p>Programming assignments start with numpy in the first course; no DL frameworks just yet. I find that great: coding networks up from scratch really makes you understand the building blocks.</p>
<p>Starting course 2 you are introduced first to tensorflow, then to keras. You will still need to implement numpy code from time to time, but coding in the frameworks will take more and more of the emphasis. I think the timing of the transition is well-placed.</p>
<p>I found the assignment topics also interesting. For example, classifying cats vs non-cats (of course), transferring the impressionistic style of Monet to a picture of the Louvre, generating new dinosaur names, generating jazz music, and more.</p>
<p>I do have a beef with the assignments: they are way too easy, involving too much hand-holding. Usually, you are given an empty function that you have to fill in based on instructions and pseudo-code. Very often you only have to write a few lines of code, and much of the assignments can be completed by someone with no knowledge of DL who just reads the pseudo-code carefully.</p>
<h2 id="blitzing-through-the-course">Blitzing through the Course</h2>
<p>You can audit the course for free. Auditing means you can watch all the videos, but can’t take the quizzes and programming assignments. If you want to do those, you have to sign up. You get 7 days for free, after that you’ll have to pay.</p>
<p>I was inclined to finish the course quickly, so I decided to first audit the class and watch most of the lectures. Then I signed up, and aimed to complete quizzes plus programming in a week. It wasn’t too hard to do that, most assignments take 2 hours tops.</p>
<p>Once you complete everything you get a lovely certificate:</p>
<p><img src="/assets/img/dl_certificate.jpg" alt="png" /></p>
<p>I don’t think the certificate is worth much, but it feels good to have completed this course. As Andrew’s (now somewhat aged) machine learning course, I expect it to become the standard go-to beginner’s tutorial.</p>krisztiankovacsI recently completed Andrew Ng’s deep learning specialization on Coursera.RL Notes 4: Temporal Differences2018-10-07T02:11:44+00:002018-10-07T02:11:44+00:00https://krisztiankovacs.com/rl-notes-4-temporal-differences<p><a href="/rl-notes-3-exporation-vs-exploitation/">Previous week’s notes</a></p>
<p>(You can find the notebook containing the code <a href="https://github.com/kk1694/rl_course/blob/master/rl_notes_4.ipynb">here</a>)</p>
<p>One drawback of the MC methods covered <a href="https://krisztiankovacs.com/reinforcement_learning/2018/09/30/rl-notes-3-exporation-vs-exploitation.html">last week</a> is that they need to simulate complete episodes of an environment (start to finish) before updating our policy. That’s fine if individual episodes are short. However, usually we don’t want to wait until the end of an episode to update our policy. Also, some environments are continuous.</p>
<p>Why not update the policy as we go through the task? That’s the essence of temporal difference methods (TD). TD is to Monte Carlo what stochastic gradient descent is to batch gradient descent. Update after each round, don’t wait until the end. One TD method is Q-learning. It is really simple, but is often described in an overcomplicated way.</p>
<p>Reminder: the Q function is a mapping that gives the value of state-action combinations. To obtain the best action in a state, look up the maximum Q value in that state (among available actions).</p>
<p>We start with randomly initializing Q. After every action, we update it according to:</p>
<script type="math/tex; mode=display">Q(s_t, a_t) = (1 - \alpha)Q(s_t, a_t) + \alpha (r_t + \gamma * \max_{a} Q(s_{t+1}, a) )</script>
<p>Or, equivalently:</p>
<script type="math/tex; mode=display">Q(s_t, a_t) = Q(s_t, a_t) + \alpha (r_t + \gamma * \max_{a} Q(s_{t+1}, a) - Q(s_t, a_t))</script>
<p>where</p>
<ul>
<li><script type="math/tex">Q(., .)</script> is the value of an action in a given state,</li>
<li><script type="math/tex">\alpha</script> is the learning rate (how quickly we update values,</li>
<li><script type="math/tex">r_t</script> is the immediate reward at time <em>t</em>,</li>
<li><script type="math/tex">\gamma</script> is the discount factor</li>
</ul>
<p>In English: after every action we update our Q function. We update it with the immediate reward, plus the discounted value of the best action in the next time-step.</p>
<p>The learning rate controls to what degree we update vs. keep the old value. A learning rate of 0 means no updating at all, a learning rate of 1 means perfect replacement. We don’t want perfect replacement, as the reward can be stochastic.</p>
<p>One trick: we should decrease the learning rate over time. The more often we take an action in a given state, the more certain we are about its reward distribution. In the example below, I apply the following learning rate schedule:</p>
<script type="math/tex; mode=display">\alpha_{t} = \frac{\alpha_0}{1 + \alpha_{taper} * count(s_t, a_t)}</script>
<p>where <script type="math/tex">count(s, a)</script> gives the number of times the agent did action <em>a</em> in state <em>s</em>. The hyper-parameters <script type="math/tex">\alpha_0</script> and <script type="math/tex">\alpha_{taper}</script> give the initial learning rate, and how fast I taper it to 0.</p>
<h2 id="cheating-roulette-again">Cheating Roulette again</h2>
<p>I’m going to use the same <a href="https://github.com/kk1694/rl_course/blob/master/rl_notes_3.ipynb">cheating roulette</a> environment as last week.</p>
<p>I’ll only implement the <em><script type="math/tex">\epsilon</script>-greedy</em> policy, but the other ones are easy to adopt as well.</p>
<p><img src="/assets/img/rl_notes_4/output_12_1.png" alt="png" /></p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Actions taken by agent in different rounds
</code></pre></div></div>
<p><img src="/assets/img/rl_notes_4/output_13_1.png" alt="png" /></p>
<p>The agent learns remarkably fast. That’s no surprise, as we update after each action instead of waiting for the end of the episode.</p>krisztiankovacsPrevious week’s notesDeep Learning Journey 1 Month Review2018-10-03T02:11:44+00:002018-10-03T02:11:44+00:00https://krisztiankovacs.com/dl-journey-1-mont-review<p>It has been a busy month. After planning out <a href="/dl-journey-overview/">how to approach deep learning</a>, I dove straight into it.</p>
<p>I started with reviewing math and Python syntax. For the needed math, I skimmed through the first part of <a href="http://www.deeplearningbook.org/">Goodfellow’s deep learning book</a>, and for Python syntax I went through <a href="https://www.udemy.com/python-for-data-science-and-machine-learning-bootcamp">Udemy’s Python for data science</a> course.</p>
<p>After that I started completing the excellent fast.ai course (part 1). For each topic, I tried to recreate the notebook from the lectures on a different dataset (such as <a href="/harry-potter-image-classification/">this</a> Harry Potter image classification example). I was progressing faster than scheduled, having almost finished the course by the end of week 2.</p>
<p>Making good progress, I decided to commit to additional projects that popped up right around this time.</p>
<p>The first is the <a href="https://www.kaggle.com/c/inclusive-images-challenge">Inclusive Images</a> Kaggle competition. Besides being a giant image classification challenge, it has the interesting property of evaluating the submitted models on geographically distinct test data (relative to the training set). I’m looking forward to testing out some of my ideas on how to make image models generalize better.</p>
<p>Google generously donated $1000 worth of computing credits to all participants in this challenge. I obviously wanted to take advantage of this offer, so I spent a good amount of time navigating their cloud options to set up a virtual machine. Getting the machine ready took a surprisingly long amount of time (setting up the VM, configuring SSH, figuring out their network settings to access Jupyter notebooks, downloading the 500+ GB dataset, etc.).</p>
<p>A second unexpected development I decided to take advantage of is Siraj Raval’s <a href="https://www.youtube.com/watch?v=fRmZck1Dakc">Move 37</a> course on reinforcement learning (RL). Originally, I planned to study RL sometime between week 5 to 10, but since this new course just started I thought I would front-load learning the topic.</p>
<p>I am also changing some of my plans going forward. Jeremy Howard recently announced a fresh version of their fast.ai course (the one I just completed) starting in October. A lot of the content is expected to be change, so I’ll definitely tag along. However, I don’t plan to commit too much to part 2 of their current version (as the library will change), and wait until the next iteration. I plan to watch the lectures for ideas though.</p>
<p>Overall, I’m quite happy with how this month turned out. My greatest improvement has been with pytorch - a library I knew nothing about a month ago, and by now I feel quite confident using it to implement most of the basic DL models.</p>krisztiankovacsIt has been a busy month. After planning out how to approach deep learning, I dove straight into it.RL Notes 3: Exploration vs Exploitation2018-09-30T02:11:44+00:002018-09-30T02:11:44+00:00https://krisztiankovacs.com/rl-notes-3-exporation-vs-exploitation<p><a href="/rl-notes-2/">Previous week’s notes</a></p>
<p>(You can find the notebook containing the code <a href="https://github.com/kk1694/rl_course/blob/master/rl_notes_3.ipynb">here</a>)</p>
<p>Until now, we used environments were had perfect information: we knew the rewards and transition probabilities. In those contexts, we could use policy or value iteration to find our optimal policy.</p>
<p>Of course, the interesting cases are when we don’t have any information whatsoever. We don’t know the transition probabilities, and we may not even know the rewards.</p>
<p>Let’s briefly review our setting. We’re playing a game. There are certain states we can be in, and actions we can take. We don’t know the result of an action. Sometimes we get a reward. We don’t know when this happens, but we want more of it.</p>
<p>How do we solve situations like this? We solve them by repeatedly simulating the environment. We use our results estimate the unknown quantities (the fancy name for this technique is Monte Carlo methods).</p>
<p>I discuss some concrete strategies below, but one theme will be common to them all:</p>
<h3 id="the-exploration-exploitation-dilemma">The exploration-exploitation dilemma</h3>
<p>At any point we have two options available to us: we can <strong>explore</strong> the possibilities we have, or we can <strong>exploit</strong> the policy we judged so far to be the best. Most strategies will start with exploration in the beginning so that we can gather information. As we do so, we gradually shift towards exploitation of the strategy we judged to be the best.</p>
<p>If we keep on exploring for too long, we waste valuable opportunities to collect rewards.</p>
<p>If we exploit too early, we may be locked into a sub-optimal solution – had we gathered more information, we could have discovered a better policy.</p>
<h3 id="strategies-illustrated-here">Strategies illustrated here</h3>
<p>In this post, I will discuss the following, very common policies:</p>
<ul>
<li>Greedy</li>
<li>Explore First, then Greedy,</li>
<li>Epsilon-Greedy</li>
<li>Optimistic Initialization</li>
<li>Optimism in the Face of Uncertainty</li>
<li>Thompson Sampling</li>
</ul>
<h3 id="the-q-function">The Q Function</h3>
<p>Many policies will refer in one way or another to the Q function. This function maps state-action combinations to average returns. It is basically a big lookup table.</p>
<h2 id="the-environment-cheating-roulette">The Environment: Cheating Roulette</h2>
<h3 id="openais-simple-roulette">Openai’s simple roulette</h3>
<p><img src="/assets/img/rl_notes_3/images.jpeg" alt="jpeg" /></p>
<p>Openai contains a simplified roulette environment. The rules are as follows:</p>
<ul>
<li>You have a roulette wheel with numbers 0-36.</li>
<li>Every round you can select a number, or you can leave the roulette table, ending the game.</li>
<li>If you select 0, and it comes up, you receive $36. If it doesn’t come up, you loose $1.</li>
<li>If you select a non-zero number, you receive $1 if the parity of your number matches the parity of the roll (0 doesn’t count). Otherwise, you loose $1. Example: if you select 5 and 17 is rolled you receive $1.</li>
<li>Each number has the same probability of showing up.</li>
</ul>
<p>Simple math will tell you that this roulette has a negative expected value (just as in real life). The best action the agent can take is simply leave the table (again, as in real life).</p>
<h3 id="cheating-roulette-modification">Cheating roulette modification</h3>
<p>That’s not a very interesting setting however. So we will modfiy this roulette. In particular, we will change it so that</p>
<ul>
<li>0 has a 0.033 probability of showing up (instead of ~0.027).</li>
<li>Every other number is equally likely.</li>
<li>The payoff is the same as in the original game.</li>
</ul>
<p>The effect of this change is that playing <strong>0</strong> has now positive expected value. However, it is extremely difficult to find out about this feature. After all, even with the modification, 0 will only be rolled once in 30 rounds (on average).</p>
<p>Moreover, not knowing the <strong>0</strong> cheat (playing policies randomly), the expected value is still negative. That makes leaving the table still attractive, although in this case, suboptimal.</p>
<p>Basically, I have set up an environment where it is easy for an agent to get trapped in a sub-optimal solution (leaving the table).</p>
<h3 id="the-simulations">The Simulations</h3>
<p>I will consider to be a game of roulette to be 100 rounds. The aim is to maximize the money an agent makes in these 100 rounds.</p>
<p>Each strategy will play 10 000 games (each 100 rounds). After each game, the agent will be able to update its policy. (Why not update after each round? Because that would increase computation time, as some strategies rely on the entire history of the game.)</p>
<h2 id="the-policies">The Policies</h2>
<h3 id="basic-benchmark-policies">Basic Benchmark Policies</h3>
<p>Before we discuss the strategies listed above, let’s have some useful benchmarks.</p>
<p>The first option is to always leave the table; never play. Obviously, that will have a return of 0.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">DontPlay</span><span class="p">(</span><span class="n">Policy</span><span class="p">):</span>
<span class="k">def</span> <span class="nf">pi</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span> <span class="k">return</span> <span class="mi">37</span> <span class="c"># This action is leaving the table</span>
</code></pre></div></div>
<p>More interestintly, what if we play in ‘God mode’, knowing the <strong>0</strong> cheat? In that case, we will earn, on average, about $22 per game (per 100 rounds).</p>
<p>Below is a plot of the money we earn in each simulated round. Note that even knowing the cheat, in most games we still loose money, as we need 4 zeros per game to get a positive profit. Our positive average is due to the few games where zeros came up many times.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">OptimalPlay</span><span class="p">(</span><span class="n">Policy</span><span class="p">):</span>
<span class="k">def</span> <span class="nf">pi</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span> <span class="k">return</span> <span class="mi">0</span>
</code></pre></div></div>
<p><img src="/assets/img/rl_notes_3/output_17_1.png" alt="png" /></p>
<p>Just to check, what happens if we always select a random number? The return is somewhat negative: we loose money on average, about $1 per game (per 100 rounds).</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">RandPolicy</span><span class="p">(</span><span class="n">Policy</span><span class="p">):</span>
<span class="k">def</span> <span class="nf">pi</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span> <span class="k">return</span> <span class="bp">self</span><span class="o">.</span><span class="n">env</span><span class="o">.</span><span class="n">action_space</span><span class="o">.</span><span class="n">sample</span><span class="p">()</span>
</code></pre></div></div>
<p><img src="/assets/img/rl_notes_3/output_20_1.png" alt="png" /></p>
<h3 id="greedy-policy">Greedy Policy</h3>
<p>Our simplest policy is a greedy policy. A greedy policy simply plays the number with the highest Q value (the highest historical return). Generally, it’s not a good choice, as it may lock into a sub-optimal solution. (There is a chance that the greedy policy will find the optimal solution, just not a reliably big one.)</p>
<p>Benefits</p>
<ul>
<li>Easy to implement, easy to explain.</li>
<li>No hypterparameters to tune.</li>
</ul>
<p>Drawbacks</p>
<ul>
<li>No exploration, often gets stuck at sub-optimal play.</li>
</ul>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">Greedy</span><span class="p">(</span><span class="n">Policy</span><span class="p">):</span>
<span class="k">def</span> <span class="nf">pi</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span> <span class="k">return</span> <span class="bp">self</span><span class="o">.</span><span class="n">getMaxQ</span><span class="p">()</span>
</code></pre></div></div>
<p><img src="/assets/img/rl_notes_3/output_23_1.png" alt="png" /></p>
<p>As we can see, for the first couple of rounds our policy followed some random choices, as some numbers, just by chance, got positive rewards. However, very quickly our policy concludes that it’s best to leave the game - and earns 0 rewards afterward.</p>
<p>Below I plot the actions taken in the first and last 1000 rounds.</p>
<p><img src="/assets/img/rl_notes_3/output_25_1.png" alt="png" /></p>
<h2 id="explore-first-then-greedy">Explore First Then Greedy</h2>
<p>Since our greedy policy had the drawback of not doing enough exploration, let’s add explicit exploration rounds.</p>
<p>We will let our policy follow random actions for the first <strong>k</strong> games. After <strong>k</strong> games, it will follow a greedy policy.</p>
<p>Benefits</p>
<ul>
<li>Simple to implement and explain.</li>
<li>Includes both exploration and exploitation.</li>
</ul>
<p>Drawbacks</p>
<ul>
<li>Need to choose k manually.</li>
<li>It is a priori unclear how much exploration/exploitation one should do in a given game.</li>
<li>Exploration isn’t done intelligently, it just chooses random actions.</li>
</ul>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">ExpFirst</span><span class="p">(</span><span class="n">Policy</span><span class="p">):</span>
<span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">exp_num</span><span class="p">):</span>
<span class="nb">super</span><span class="p">()</span><span class="o">.</span><span class="n">__init__</span><span class="p">()</span>
<span class="bp">self</span><span class="o">.</span><span class="n">exp_num</span> <span class="o">=</span> <span class="n">exp_num</span>
<span class="k">def</span> <span class="nf">pi</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
<span class="k">if</span> <span class="bp">self</span><span class="o">.</span><span class="n">episode</span> <span class="o"><=</span> <span class="bp">self</span><span class="o">.</span><span class="n">exp_num</span><span class="p">:</span>
<span class="k">return</span> <span class="bp">self</span><span class="o">.</span><span class="n">env</span><span class="o">.</span><span class="n">action_space</span><span class="o">.</span><span class="n">sample</span><span class="p">()</span>
<span class="k">else</span><span class="p">:</span>
<span class="k">return</span> <span class="bp">self</span><span class="o">.</span><span class="n">getMaxQ</span><span class="p">()</span>
</code></pre></div></div>
<p>In this example, we explore for the first 1000 games. In this particular case, it managed to find the best policy afterwards. Note however that there is still luck involved, and there is no guarantee that if we replay the same policy it will find the optimum again (it depends on how often 0 land in the first 1000 games).</p>
<p><img src="/assets/img/rl_notes_3/output_30_1.png" alt="png" /></p>
<p>The actions taken in the first and last 1000 rounds:</p>
<p><img src="/assets/img/rl_notes_3/output_31_1.png" alt="png" /></p>
<p>Similar results with 2000 games of exploration.</p>
<p><img src="/assets/img/rl_notes_3/output_33_1.png" alt="png" /></p>
<p>Another pitfall: exploring too long. Here we explore for 5000 rounds (half the time) before exploiting. This means we are more likely going to find the optimal policy, but we reduce our exploitation opportunities: our average profit is little more than half than in the first case.</p>
<p><img src="/assets/img/rl_notes_3/output_37_1.png" alt="png" /></p>
<h2 id="epsilon-greedy">Epsilon-Greedy</h2>
<p>Another way to modify the pure greedy strategy is to add some noise to it. Specifically, follow the greedy strategy with probability <script type="math/tex">(1-\epsilon)</script>, and do a random action with probability <script type="math/tex">\epsilon</script>.</p>
<p>Over time, we want to decrease <script type="math/tex">\epsilon</script>, so that we gradually shift from exploration to exploitation. One way to parameterize:</p>
<script type="math/tex; mode=display">\epsilon = \frac{k}{n}</script>
<p>where n is the number of games we played so far.</p>
<p>Benefits</p>
<ul>
<li>Still relatively simple to implement and explain.</li>
<li>Even if it gets stuck at a suboptimal solution, it has the chance to ‘break out’.</li>
<li>May adopt to changing environments.</li>
</ul>
<p>Drawbacks</p>
<ul>
<li>Unintelligent exploration</li>
<li>Need to choose hyper-parameters well.</li>
</ul>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">EpsGreedy</span><span class="p">(</span><span class="n">Policy</span><span class="p">):</span>
<span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">k</span><span class="p">):</span>
<span class="nb">super</span><span class="p">()</span><span class="o">.</span><span class="n">__init__</span><span class="p">()</span>
<span class="bp">self</span><span class="o">.</span><span class="n">k</span> <span class="o">=</span> <span class="n">k</span>
<span class="k">def</span> <span class="nf">pi</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
<span class="n">epsilon</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">k</span> <span class="o">/</span> <span class="nb">max</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">episode</span><span class="p">,</span> <span class="mi">1</span><span class="p">)</span>
<span class="k">if</span> <span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">rand</span><span class="p">()</span> <span class="o"><=</span> <span class="n">epsilon</span><span class="p">:</span>
<span class="k">return</span> <span class="bp">self</span><span class="o">.</span><span class="n">env</span><span class="o">.</span><span class="n">action_space</span><span class="o">.</span><span class="n">sample</span><span class="p">()</span>
<span class="k">else</span><span class="p">:</span>
<span class="k">return</span> <span class="bp">self</span><span class="o">.</span><span class="n">getMaxQ</span><span class="p">()</span>
</code></pre></div></div>
<p><img src="/assets/img/rl_notes_3/output_42_1.png" alt="png" /></p>
<p><img src="/assets/img/rl_notes_3/output_43_1.png" alt="png" /></p>
<p>A demonstration of the luck involved: even though we explore twice as long as in the previous case, we were unlucky and didn’t find the optimal solution.</p>
<p><img src="/assets/img/rl_notes_3/output_46_1.png" alt="png" /></p>
<p><img src="/assets/img/rl_notes_3/output_47_1.png" alt="png" /></p>
<p>More exploration. We find the optimum, but reduced our profit by exploring too long.</p>
<p><img src="/assets/img/rl_notes_3/output_50_1.png" alt="png" /></p>
<p><img src="/assets/img/rl_notes_3/output_51_1.png" alt="png" /></p>
<h2 id="optimistic-initialization">Optimistic Initialization</h2>
<p>This is not so much a stand-alone strategy, as it is a hack on our greedy policy. We set the initial Q values to be unrealistically high, then follow a greedy strategy. The effect of this hack is that our policy will try out all available actions first, as our initialized Q values are higher than what the policy can achieve.</p>
<p>While strictly better than a pure greedy strategy, it suffers from all its drawbacks.</p>
<p>Note however that this hack is available for almost any policy, not just greedy. If we want to explore all actions before committing to our policy, we initialize all Q’s to be unrealistically high. If the number of possible actions isn’t too high, this is a reasonable thing to do.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">optInitPolicy</span><span class="p">(</span><span class="n">Policy</span><span class="p">):</span>
<span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
<span class="nb">super</span><span class="p">()</span><span class="o">.</span><span class="n">__init__</span><span class="p">()</span>
<span class="bp">self</span><span class="o">.</span><span class="n">Q</span> <span class="o">=</span> <span class="mi">100</span><span class="o">*</span><span class="n">np</span><span class="o">.</span><span class="n">ones</span><span class="p">((</span><span class="bp">self</span><span class="o">.</span><span class="n">nS</span><span class="p">,</span> <span class="bp">self</span><span class="o">.</span><span class="n">nA</span><span class="p">))</span>
<span class="k">def</span> <span class="nf">pi</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span> <span class="k">return</span> <span class="bp">self</span><span class="o">.</span><span class="n">getMaxQ</span><span class="p">()</span>
</code></pre></div></div>
<p>In this case, the hack didn’t make a difference relative to the greedy policy.</p>
<p><img src="/assets/img/rl_notes_3/output_56_1.png" alt="png" /></p>
<p><img src="/assets/img/rl_notes_3/output_57_1.png" alt="png" /></p>
<h2 id="optimism-in--the-face-of-uncertainty">Optimism in the Face of Uncertainty</h2>
<p>Our policies so far haven’t implemented a clever way to explore our action space. They just followed random actions.</p>
<p>We can do better. To give some intuition, consider the following two action histories:</p>
<ul>
<li>Number 7 yielded $1 once, and $-1 twice.</li>
<li>Number 8 yielded $1 100 times, and $-1 200 times.</li>
</ul>
<p>The average return is the same for both cases (-0.33). However, we’re a lot more confident that number 8 has a negative reward, since we played it 100 times more. Number 7 could have just had a few unlucky flips.</p>
<p>Thus, we will favor uncertain actions. In the above example, we favor 7. If 7 is indeed a bad number, successive trials will prove so.</p>
<p>We select a <strong>p</strong> such that an action not chosen is at most <strong>p</strong> percent likely to be the optimal one. We then use <a href="https://en.wikipedia.org/wiki/Hoeffding%27s_inequality#Special_case_of_Bernoulli_random_variables">Hoeffding’s Inequality</a> to choose our action according to</p>
<script type="math/tex; mode=display">argmax (Q(s, a) + H(s, a))</script>
<p>where H is a measure of uncertainty. In the case of bernoulli random variables, it would be equal to</p>
<script type="math/tex; mode=display">H(s, a) = \sqrt{\frac{-log(p)}{2n_{sa}}}</script>
<p>where <script type="math/tex">n_{sa}</script> is the frequency of the given state-action pair.</p>
<p>In our case, the formula doesn’t quite work, as <strong>0</strong> isn’t a [0, 1] (or [-1, 1]) variable. Our policy will reduce the uncertainty in the probability that a number appears, but it won’t reduce the uncertainty in the expected value (which is what we care about).</p>
<p>Usually, this strategy is not a bad approach (unless the payoff distribution is extremely skewed, as in this case).</p>
<p>Benefits</p>
<ul>
<li>Explores the action space intelligently, by favoring actions with uncertain payoffs.</li>
<li>Hyperparameter has a meaningful interpretation.</li>
</ul>
<p>Drawbacks</p>
<ul>
<li>Formula is only exact for [0, 1] binary variables.</li>
<li>Doesn’t handle highly skewed payoffs well.</li>
</ul>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">optUnc</span><span class="p">(</span><span class="n">Policy</span><span class="p">):</span>
<span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">p</span> <span class="o">=</span> <span class="mf">0.05</span><span class="p">):</span>
<span class="nb">super</span><span class="p">()</span><span class="o">.</span><span class="n">__init__</span><span class="p">()</span>
<span class="bp">self</span><span class="o">.</span><span class="n">H</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">ones</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">Q</span><span class="o">.</span><span class="n">shape</span><span class="p">)</span>
<span class="bp">self</span><span class="o">.</span><span class="n">H</span><span class="p">[</span><span class="mi">0</span><span class="p">,</span> <span class="mi">37</span><span class="p">]</span> <span class="o">=</span> <span class="mi">0</span>
<span class="bp">self</span><span class="o">.</span><span class="n">p</span> <span class="o">=</span> <span class="n">p</span>
<span class="k">def</span> <span class="nf">pi</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
<span class="n">Q</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">Q</span><span class="p">[</span><span class="bp">self</span><span class="o">.</span><span class="n">state</span><span class="p">,</span> <span class="p">:]</span>
<span class="n">H</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">H</span><span class="p">[</span><span class="bp">self</span><span class="o">.</span><span class="n">state</span><span class="p">,</span> <span class="p">:]</span>
<span class="k">return</span> <span class="n">np</span><span class="o">.</span><span class="n">argmax</span><span class="p">(</span><span class="n">Q</span><span class="o">+</span><span class="n">H</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">updateH</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
<span class="k">for</span> <span class="n">s</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">nS</span><span class="p">):</span>
<span class="k">for</span> <span class="n">a</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">nA</span><span class="p">):</span>
<span class="k">if</span> <span class="p">(</span><span class="ow">not</span> <span class="bp">self</span><span class="o">.</span><span class="n">returns</span><span class="p">[(</span><span class="n">s</span><span class="p">,</span> <span class="n">a</span><span class="p">)]</span> <span class="o">==</span> <span class="p">[])</span> <span class="o">&</span> <span class="p">(</span><span class="n">a</span> <span class="o">!=</span> <span class="mi">37</span><span class="p">):</span>
<span class="n">n</span> <span class="o">=</span> <span class="nb">len</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">returns</span><span class="p">[(</span><span class="n">s</span><span class="p">,</span> <span class="n">a</span><span class="p">)])</span>
<span class="bp">self</span><span class="o">.</span><span class="n">H</span><span class="p">[</span><span class="n">s</span><span class="p">,</span> <span class="n">a</span><span class="p">]</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">sqrt</span><span class="p">(</span><span class="o">-</span><span class="n">np</span><span class="o">.</span><span class="n">log</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">p</span><span class="p">)</span><span class="o">/</span><span class="p">(</span><span class="mi">2</span> <span class="o">*</span> <span class="n">n</span><span class="p">))</span>
<span class="k">def</span> <span class="nf">callOnEnd</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span> <span class="bp">self</span><span class="o">.</span><span class="n">updateH</span><span class="p">()</span>
</code></pre></div></div>
<p>Voila: after only few rounds of exploring, we find the best policy.</p>
<p><img src="/assets/img/rl_notes_3/output_62_1.png" alt="png" /></p>
<h2 id="thompson-sampling">Thompson Sampling</h2>
<p>Here is another way to intelligently explore: let’s sample actions by the probability that they are optimal. This way, actions that had good historical returns will be favored, but we will also occasionally sample actions that didn’t have good returns.</p>
<p>Specifically, we will assume a prior distribution for the expected payoff for each number. In this example, we use a Beta distribution with parameters (1, 1). Then, as we collect rewards, we update these parameters.</p>
<p>In each round, we take a sample from the distribution of each action. We choose the action with the highest sampled value.</p>
<p>Benefits</p>
<ul>
<li>Intelligently explores the action space.</li>
</ul>
<p>Drawbacks</p>
<ul>
<li>Doesn’t handle data skew well (at least not with a beta prior).</li>
</ul>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">ThompSamp</span><span class="p">(</span><span class="n">Policy</span><span class="p">):</span>
<span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
<span class="nb">super</span><span class="p">()</span><span class="o">.</span><span class="n">__init__</span><span class="p">()</span>
<span class="bp">self</span><span class="o">.</span><span class="n">alphas</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">ones</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">Q</span><span class="o">.</span><span class="n">shape</span><span class="p">)</span>
<span class="bp">self</span><span class="o">.</span><span class="n">betas</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">ones</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">Q</span><span class="o">.</span><span class="n">shape</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">pi</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
<span class="n">s</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">state</span>
<span class="n">n</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">alphas</span><span class="p">[</span><span class="n">s</span><span class="p">,</span> <span class="p">:]</span><span class="o">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span>
<span class="n">sampl</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">ones</span><span class="p">(</span><span class="n">n</span><span class="p">)</span><span class="o">/</span><span class="mi">2</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">n</span> <span class="o">-</span> <span class="mi">1</span><span class="p">):</span>
<span class="n">sampl</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">beta</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">alphas</span><span class="p">[</span><span class="n">s</span><span class="p">,</span> <span class="n">i</span><span class="p">],</span> <span class="bp">self</span><span class="o">.</span><span class="n">betas</span><span class="p">[</span><span class="n">s</span><span class="p">,</span> <span class="n">i</span><span class="p">])</span>
<span class="bp">self</span><span class="o">.</span><span class="n">current_action</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">argmax</span><span class="p">(</span><span class="n">sampl</span><span class="p">)</span>
<span class="k">return</span> <span class="bp">self</span><span class="o">.</span><span class="n">current_action</span>
<span class="k">def</span> <span class="nf">callOnEnd</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
<span class="k">for</span> <span class="n">state</span><span class="p">,</span> <span class="n">action</span><span class="p">,</span> <span class="n">G</span> <span class="ow">in</span> <span class="bp">self</span><span class="o">.</span><span class="n">states_actions_returns</span><span class="p">:</span>
<span class="k">if</span> <span class="n">G</span> <span class="o">></span> <span class="mi">0</span><span class="p">:</span>
<span class="bp">self</span><span class="o">.</span><span class="n">alphas</span><span class="p">[</span><span class="n">state</span><span class="p">,</span> <span class="n">action</span><span class="p">]</span> <span class="o">+=</span> <span class="n">G</span>
<span class="k">elif</span> <span class="bp">self</span><span class="o">.</span><span class="n">reward</span> <span class="o"><</span> <span class="mi">0</span><span class="p">:</span>
<span class="bp">self</span><span class="o">.</span><span class="n">betas</span><span class="p">[</span><span class="n">state</span><span class="p">,</span> <span class="n">action</span><span class="p">]</span> <span class="o">+=</span> <span class="o">-</span><span class="n">G</span>
</code></pre></div></div>
<p>This graph is interesting. Only towards the second half does the agent swing to the optimal policy. The likely reason: until then zeros didn’t have a good track record (remember, they only come up once in 30 turns on average). Not having a good track record made it less likely that they got selected.</p>
<p>However, by (weighted) random sampling, every now and then we did select a zero. And one success had a major update on distribution weights. Zeros got selected more and more often, and a positive cascade got built.</p>
<p><img src="/assets/img/rl_notes_3/output_66_1.png" alt="png" /></p>
<p><img src="/assets/img/rl_notes_3/output_67_1.png" alt="png" /></p>krisztiankovacsPrevious week’s notesRL Notes 2: Policy and Value Iteration2018-09-21T02:11:44+00:002018-09-21T02:11:44+00:00https://krisztiankovacs.com/rl-notes-2<p><a href="/rl-notes-1/">Previous week’s notes</a></p>
<p>(You can find the notebook containing the code <a href="https://github.com/kk1694/rl_course/blob/master/rl_notes_2.ipynb">here</a>, and a tutorial I found really useful is <a href="https://towardsdatascience.com/reinforcement-learning-demystified-solving-mdps-with-dynamic-programming-b52c8093c919">here</a>)</p>
<p>This week was all about two tools to solve MDPs: <strong>policy iteration</strong> and <strong>value iteration</strong>. Both of them are applications of the Bellman equation. Here I will try to explain both without any math, and without any Bellman equations.</p>
<h2 id="frozen-lake">Frozen Lake</h2>
<p>The example I’ll use is the frozen lake environment of Openai.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">env</span> <span class="o">=</span> <span class="n">gym</span><span class="o">.</span><span class="n">make</span><span class="p">(</span><span class="s">'FrozenLake-v0'</span><span class="p">)</span>
</code></pre></div></div>
<p><img src="/assets/img/rl_notes_2/output_2_0.png" alt="png" /></p>
<p>It is a similar gridworld to the one covered <a href="https://krisztiankovacs.com/reinforcement_learning/2018/09/19/rl-notes-1.html">last week</a>. We start at the top left (at the X), and try to move to the orange square in the bottom right, where we get our reward of + 1 and the game ends. If we move to a black square, we fall into a hole: the game ends without a reward.</p>
<p>There is one important complication: as we move on the blue tiles, we might slip to the side, and end up in a tile different from the one we intended. There is a 33% chance that we move correctly, and a 33% chance for landing in either side. We never slip to the opposite direction of our action though.</p>
<p>As mentioned above, there two ways to solve this game.</p>
<h1 id="value-iteration">Value Iteration</h1>
<p>Our goal here is to find the state value function under optimal policy. That means each square will be assigned a number, representing its ‘value’. Once we have that, finding the optimal policy will be easy.</p>
<p>How do we find the state value function? We work iteratively. We first initialize all values to 0. We then find then optimal action for each state. The optimal action is the one that maximizes the expected return (the immediate reward + discounted future value). Once we have that, we assign to the state the maximum expected return (among all actions). We iterate until the state value function converges.</p>
<p>To make our agent prefer shorter paths (i.e. finish the game earlier than later), we set a discount factor of 99%.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">getExpectedActionValue</span><span class="p">(</span><span class="n">s</span><span class="p">,</span> <span class="n">a</span><span class="p">,</span> <span class="n">V</span><span class="p">,</span> <span class="n">env</span><span class="p">,</span> <span class="n">g</span><span class="p">):</span>
<span class="s">'''Gives expected return of action "a" in state "s" (immediate reward + discounted future value)'''</span>
<span class="k">return</span> <span class="nb">sum</span><span class="p">(</span><span class="n">prob</span><span class="o">*</span><span class="p">(</span><span class="n">reward</span> <span class="o">+</span> <span class="n">g</span><span class="o">*</span><span class="n">V</span><span class="p">[</span><span class="n">state</span><span class="p">])</span> <span class="k">for</span> <span class="n">prob</span><span class="p">,</span> <span class="n">state</span><span class="p">,</span> <span class="n">reward</span><span class="p">,</span> <span class="n">_</span> <span class="ow">in</span> <span class="n">env</span><span class="o">.</span><span class="n">env</span><span class="o">.</span><span class="n">P</span><span class="p">[</span><span class="n">s</span><span class="p">][</span><span class="n">a</span><span class="p">])</span>
<span class="k">def</span> <span class="nf">getOptimalFromState</span><span class="p">(</span><span class="n">s</span><span class="p">,</span> <span class="n">V</span><span class="p">,</span> <span class="n">env</span><span class="p">,</span> <span class="n">g</span><span class="p">):</span>
<span class="s">'''Gives the best possible return in state "s" from all actions'''</span>
<span class="n">vals</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">([</span><span class="n">getExpectedActionValue</span><span class="p">(</span><span class="n">s</span><span class="p">,</span> <span class="n">a</span><span class="p">,</span> <span class="n">V</span><span class="p">,</span> <span class="n">env</span><span class="p">,</span> <span class="n">g</span><span class="p">)</span> <span class="k">for</span> <span class="n">a</span> <span class="ow">in</span> <span class="n">env</span><span class="o">.</span><span class="n">env</span><span class="o">.</span><span class="n">P</span><span class="p">[</span><span class="n">s</span><span class="p">]</span><span class="o">.</span><span class="n">keys</span><span class="p">()])</span>
<span class="k">return</span> <span class="p">(</span><span class="n">vals</span><span class="o">.</span><span class="nb">max</span><span class="p">(),</span> <span class="n">vals</span><span class="o">.</span><span class="n">argmax</span><span class="p">())</span>
<span class="k">def</span> <span class="nf">getNextV</span><span class="p">(</span><span class="n">V</span><span class="p">,</span> <span class="n">env</span><span class="p">,</span> <span class="n">g</span><span class="p">,</span> <span class="n">pi</span> <span class="o">=</span> <span class="bp">None</span><span class="p">):</span>
<span class="s">'''Gives the next iteration of the value function.'''</span>
<span class="n">res</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">zeros</span><span class="p">(</span><span class="n">V</span><span class="o">.</span><span class="n">shape</span><span class="p">)</span>
<span class="k">for</span> <span class="n">s</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">env</span><span class="o">.</span><span class="n">env</span><span class="o">.</span><span class="n">nS</span><span class="p">):</span>
<span class="k">if</span> <span class="n">pi</span> <span class="ow">is</span> <span class="bp">None</span><span class="p">:</span>
<span class="n">res</span><span class="p">[</span><span class="n">s</span><span class="p">],</span> <span class="n">_</span> <span class="o">=</span> <span class="n">getOptimalFromState</span><span class="p">(</span><span class="n">s</span><span class="p">,</span> <span class="n">V</span><span class="p">,</span> <span class="n">env</span><span class="p">,</span> <span class="n">g</span><span class="p">)</span>
<span class="k">else</span><span class="p">:</span>
<span class="n">res</span><span class="p">[</span><span class="n">s</span><span class="p">]</span> <span class="o">=</span> <span class="n">getExpectedActionValue</span><span class="p">(</span><span class="n">s</span><span class="p">,</span> <span class="n">pi</span><span class="p">[</span><span class="n">s</span><span class="p">],</span> <span class="n">V</span><span class="p">,</span> <span class="n">env</span><span class="p">,</span> <span class="n">g</span><span class="p">)</span>
<span class="k">return</span> <span class="n">res</span>
</code></pre></div></div>
<p>For example, here are the first three iterations of our state value function:</p>
<p><img src="/assets/img/rl_notes_2/output_8_0.png" alt="png" /></p>
<p>Note how the values ‘spread’ from the bottom right corner.</p>
<p>We do this process until our value function converges.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">ValueIter</span><span class="p">(</span><span class="n">env</span><span class="p">,</span> <span class="n">g</span><span class="p">,</span> <span class="n">pi</span> <span class="o">=</span> <span class="bp">None</span><span class="p">,</span> <span class="n">eps</span> <span class="o">=</span> <span class="mf">1e-10</span><span class="p">,</span> <span class="n">maxiter</span> <span class="o">=</span> <span class="mi">10000</span><span class="p">):</span>
<span class="s">'''Applies value iteration if pi = None. If pi (policy) is given, it applies policy evaluation'''</span>
<span class="n">V</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">zeros</span><span class="p">(</span><span class="n">env</span><span class="o">.</span><span class="n">env</span><span class="o">.</span><span class="n">nS</span><span class="p">);</span> <span class="n">flag_success</span> <span class="o">=</span> <span class="bp">False</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">maxiter</span><span class="p">):</span>
<span class="n">V_old</span> <span class="o">=</span> <span class="n">V</span><span class="o">.</span><span class="n">copy</span><span class="p">()</span>
<span class="n">V</span> <span class="o">=</span> <span class="n">getNextV</span><span class="p">(</span><span class="n">V</span><span class="p">,</span> <span class="n">env</span><span class="p">,</span> <span class="n">g</span><span class="p">,</span> <span class="n">pi</span><span class="p">)</span>
<span class="k">if</span> <span class="n">np</span><span class="o">.</span><span class="nb">max</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="nb">abs</span><span class="p">(</span><span class="n">V</span> <span class="o">-</span> <span class="n">V_old</span><span class="p">))</span> <span class="o"><</span> <span class="n">eps</span><span class="p">:</span>
<span class="n">flag_success</span> <span class="o">=</span> <span class="bp">True</span>
<span class="k">break</span>
<span class="k">assert</span> <span class="n">flag_success</span>
<span class="k">return</span> <span class="p">(</span><span class="n">V</span><span class="p">,</span> <span class="n">i</span><span class="p">)</span>
</code></pre></div></div>
<p><img src="/assets/img/rl_notes_2/output_10_0.png" alt="png" /></p>
<p>Let’s do a check. Suppose we are on third row, second column (value 0.64). What’s our best action? To move down, obviously. With 33% we end up left or right. What is our return? <script type="math/tex">0.99 * \frac{1}{3}*(0.62 +0.74 +0.59) = 0.64</script>. It checks out.</p>
<p>Now, let’s get our optimal policy. As a reminder, a policy is an action for every square. It’s easy to find it: for every square let’s look at what would maximize our return according to the state value function we derived.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">getPath</span><span class="p">(</span><span class="n">V</span><span class="p">,</span> <span class="n">env</span><span class="p">,</span> <span class="n">g</span><span class="p">):</span>
<span class="s">'''Gets the optimal path according to state-value function V.'''</span>
<span class="n">path</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">zeros</span><span class="p">(</span><span class="n">env</span><span class="o">.</span><span class="n">env</span><span class="o">.</span><span class="n">nS</span><span class="p">)</span>
<span class="k">for</span> <span class="n">s</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">env</span><span class="o">.</span><span class="n">env</span><span class="o">.</span><span class="n">nS</span><span class="p">):</span>
<span class="n">_</span><span class="p">,</span> <span class="n">path</span><span class="p">[</span><span class="n">s</span><span class="p">]</span> <span class="o">=</span> <span class="n">getOptimalFromState</span><span class="p">(</span><span class="n">s</span><span class="p">,</span> <span class="n">V</span><span class="p">,</span> <span class="n">env</span><span class="p">,</span> <span class="n">g</span><span class="p">)</span>
<span class="k">return</span> <span class="n">path</span>
</code></pre></div></div>
<p><img src="/assets/img/rl_notes_2/output_12_0.png" alt="png" /></p>
<p>(For those wondering why we move ‘left’ at the start: because if we do, the only movement we can take is ‘slipping’ down.)</p>
<h2 id="policy-iteration">Policy Iteration</h2>
<p>A second way to solve MDPs is policy iteration.</p>
<p>We start with an arbitrary policy. In this example, I’ll choose to always move right.</p>
<p>Next we do a <strong>policy evaluation</strong> step: calculating the state value function under the given policy. As a reminder, the value of a state under a policy is the immediate reward + discounted value of the state we end up in. We iteratively update our value function until it converges, just as in the previous section. The key difference: we choose our action based on the given policy (not the optimal one).</p>
<p>Policy evaluation gives us a state value function. For example, here is the value function for always moving to the right:</p>
<p><img src="/assets/img/rl_notes_2/output_15_0.png" alt="png" /></p>
<p>Now we change our policy according to this value function. That gives us a new policy.</p>
<p><img src="/assets/img/rl_notes_2/output_17_0.png" alt="png" /></p>
<p>To summarize, our process looks like this:</p>
<p>initial policy -> policy evaluation -> policy improvement -> policy evaluation -> policy improvement -> …</p>
<p>We keep on repeating this process until convergence.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">policyIter</span><span class="p">(</span><span class="n">env</span><span class="p">,</span> <span class="n">g</span><span class="p">,</span> <span class="n">maxiter</span> <span class="o">=</span> <span class="mi">1000</span><span class="p">):</span>
<span class="s">'''Finds optimal policy using policy iteration.'''</span>
<span class="c"># Start by always going to the right</span>
<span class="n">pi</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">([</span><span class="mi">2</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">env</span><span class="o">.</span><span class="n">env</span><span class="o">.</span><span class="n">nS</span><span class="p">)])</span>
<span class="n">flag_success</span> <span class="o">=</span> <span class="bp">False</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">maxiter</span><span class="p">):</span>
<span class="n">pi_old</span> <span class="o">=</span> <span class="n">pi</span><span class="o">.</span><span class="n">copy</span><span class="p">()</span>
<span class="n">V</span><span class="p">,</span> <span class="n">_</span> <span class="o">=</span> <span class="n">ValueIter</span><span class="p">(</span><span class="n">env</span><span class="p">,</span> <span class="n">g</span><span class="p">,</span> <span class="n">pi</span><span class="p">)</span> <span class="c"># Policy evaluation step</span>
<span class="n">pi</span> <span class="o">=</span> <span class="n">getPath</span><span class="p">(</span><span class="n">V</span><span class="p">,</span> <span class="n">env</span><span class="p">,</span> <span class="n">g</span><span class="p">)</span> <span class="c"># Policy Improvement step</span>
<span class="k">if</span> <span class="n">np</span><span class="o">.</span><span class="nb">all</span><span class="p">(</span><span class="n">pi</span> <span class="o">==</span> <span class="n">pi_old</span><span class="p">):</span>
<span class="n">flag_success</span> <span class="o">=</span> <span class="bp">True</span>
<span class="k">break</span>
<span class="k">assert</span> <span class="n">flag_success</span>
<span class="k">return</span> <span class="p">(</span><span class="n">pi</span><span class="p">,</span> <span class="n">V</span><span class="p">)</span>
</code></pre></div></div>
<p><img src="/assets/img/rl_notes_2/output_20_0.png" alt="png" /></p>
<h2 id="comparing-value-and-policy-iteration">Comparing Value and Policy Iteration</h2>
<p>Do they give the same policy? Yes, they do.</p>
<p>What about runtime? In our example, policy iteration is somewhat faster: 170 ms vs 360 ms.</p>
<p>What reward do would our policy get in the game? Since not all games are won, the outcome is random. On average, our reward is about 0.75, which is equivalent to getting to the target cell about 75% of the time.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">runPolicy</span><span class="p">(</span><span class="n">policy</span><span class="p">,</span> <span class="n">maxiter</span> <span class="o">=</span> <span class="mi">1000</span><span class="p">,</span> <span class="n">do_render</span> <span class="o">=</span> <span class="bp">True</span><span class="p">,</span> <span class="n">env_name</span> <span class="o">=</span> <span class="s">'FrozenLake-v0'</span><span class="p">):</span>
<span class="n">temp_env</span> <span class="o">=</span> <span class="n">gym</span><span class="o">.</span><span class="n">make</span><span class="p">(</span><span class="n">env_name</span><span class="p">)</span>
<span class="n">s</span> <span class="o">=</span> <span class="n">temp_env</span><span class="o">.</span><span class="n">reset</span><span class="p">()</span>
<span class="n">cum_reward</span> <span class="o">=</span> <span class="mi">0</span>
<span class="k">for</span> <span class="n">k</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">maxiter</span><span class="p">):</span>
<span class="k">if</span> <span class="n">do_render</span><span class="p">:</span>
<span class="n">temp_env</span><span class="o">.</span><span class="n">render</span><span class="p">()</span>
<span class="n">s</span><span class="p">,</span> <span class="n">r</span><span class="p">,</span> <span class="n">done</span><span class="p">,</span> <span class="n">_</span> <span class="o">=</span> <span class="n">temp_env</span><span class="o">.</span><span class="n">step</span><span class="p">(</span><span class="nb">int</span><span class="p">(</span><span class="n">policy</span><span class="p">[</span><span class="n">s</span><span class="p">]))</span>
<span class="n">cum_reward</span> <span class="o">+=</span> <span class="n">r</span>
<span class="k">if</span> <span class="n">done</span><span class="p">:</span>
<span class="k">break</span>
<span class="k">if</span> <span class="n">do_render</span><span class="p">:</span>
<span class="n">temp_env</span><span class="o">.</span><span class="n">close</span><span class="p">()</span>
<span class="k">return</span> <span class="n">cum_reward</span>
<span class="k">def</span> <span class="nf">simAvgReward</span><span class="p">(</span><span class="n">policy</span><span class="p">,</span> <span class="n">iterations</span> <span class="o">=</span> <span class="mi">1000</span><span class="p">,</span> <span class="n">env_name</span> <span class="o">=</span> <span class="s">'FrozenLake-v0'</span><span class="p">):</span>
<span class="n">totalRewards</span> <span class="o">=</span> <span class="mi">0</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">iterations</span><span class="p">):</span>
<span class="n">totalRewards</span> <span class="o">+=</span> <span class="n">runPolicy</span><span class="p">(</span><span class="n">policy</span><span class="p">,</span> <span class="n">do_render</span><span class="o">=</span><span class="bp">False</span><span class="p">,</span> <span class="n">env_name</span><span class="o">=</span><span class="n">env_name</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="n">f</span><span class="s">'Average Return from {iterations} simulations is {totalRewards / iterations}'</span><span class="p">)</span>
<span class="k">return</span> <span class="bp">None</span>
<span class="n">simAvgReward</span><span class="p">(</span><span class="n">path_value_iter</span><span class="p">)</span>
</code></pre></div></div>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Average Return from 1000 simulations is 0.754
</code></pre></div></div>krisztiankovacsPrevious week’s notesRL Notes 1: Basics2018-09-19T02:11:44+00:002018-09-19T02:11:44+00:00https://krisztiankovacs.com/rl-notes-1<p>These notes are based on <a href="https://www.youtube.com/watch?v=fRmZck1Dakc">Siraj Raval’s reinforcement learning course</a>. This week was about basic concepts of markov decision processes (MDPs).</p>
<p>First, what is reinforcement learning (RL)? It is an area of machine learning concerned with how software agents ought to take actions in an environment so as to maximize some notion of cumulative reward (from <a href="https://en.wikipedia.org/wiki/Reinforcement_learning">Wikipedia</a>). For example, an AI might learn to play chess by playing many games, and updating its strategy based on its experiences.</p>
<h2 id="basic-concepts">Basic Concepts</h2>
<p>A <strong>state</strong> is a possible condition the agent can be in the given environment.</p>
<p><strong>Markov decision processes (MDPs)</strong> refer to settings where the current state <em>s</em> contains all the available information about the future from <em>s</em>. In other words, the future from <em>s</em> is conditionally independent of the past given that we know s.</p>
<p>For example, think of chess. How we arrived at a particular board position doesn’t matter; the current board position contains all we need to know to continue the game.</p>
<p><strong>Actions</strong> are the decisions the agent can take. We can denote the set of actions in the state <em>s</em> as <em>a(s)</em>.</p>
<p><strong>Rewards</strong>, what the agent is trying to obtain, are real valued functions. They can be functions from a given state, an action, or action-state combinations.</p>
<h2 id="toy-example-gridworld">Toy Example: Gridworld</h2>
<p><img src="/assets/img/rl_mario" alt="" width="500" /></p>
<p>The states are the squares that Mario can be on. The actions are moving up, down, left, right. There are two states with rewards: the top right with +1, and below that with -1. To make the game end, we can regard these two as terminal states: the game ends if you land on these steps.</p>
<p>The example has the Markov property: how you end up on a square doesn’t influence the future of the game.</p>
<h2 id="rewards-and-policies">Rewards and Policies</h2>
<p>As the definition above mentioned, the agent will be trying to maximize ‘some notion of cumulative reward’. In some settings, it makes sense to maximize the cumulative reward, but generally we also add discounting, trying to maximize the discounted cumulative future reward:</p>
<p><script type="math/tex">R_t = \sum_t^{\infty}\gamma^i r_i</script>,</p>
<p>where <script type="math/tex">\gamma</script> is the discount factor (say, 0.9), and <script type="math/tex">r_i</script> is the reward at a given time step.</p>
<p>A <strong>policy</strong> is a way of acting. Denoted as <script type="math/tex">\pi(s)</script>, it maps a state to an action. Think of it as a strategy. (This definition applies to deterministic policies, but can easily be extended to stochastic policies as well.)</p>
<p>The optimal policy, <script type="math/tex">\pi^*</script>, is simply the one that maximizes <em>R</em>.</p>
<h2 id="value-functions">Value Functions</h2>
<p>So how do we find the optimal policy? Value functions help us do that.</p>
<p>There are two types of value functions: <em>state value functions</em> and <em>action value functions</em>. Informally, these give the ‘value’ of being in a given state, or of doing a given action.</p>
<p>Mathematically, the state value function is</p>
<script type="math/tex; mode=display">V^{\pi}(s) = E[R_t | s_t = s]</script>
<p>the expected (discounted cumulative future) reward from being in <em>s</em> under policy <script type="math/tex">\pi</script>.</p>
<p>Similarly, the action value function is</p>
<script type="math/tex; mode=display">Q^{\pi}(a, s) = E[R_t | s_t = s, a_t = a]</script>
<p>In other words, the expected reward from being in <em>s</em>, doing <em>a</em>, under policy <script type="math/tex">\pi</script>. Note that both functions are dependent on the policy.</p>
<h2 id="a-simple-bellman-equation">A Simple Bellman Equation</h2>
<p>So how to we compute these value functions? That’s what the Bellman equations are for. They allow us to express the value of a state (or action) from the value of other states (or actions), allowing us to iteratively solve for the optimal policy.</p>
<p>In a deterministic environment, the Bellman equation for the state value function is</p>
<script type="math/tex; mode=display">V^*(s) = max_a(r(s, a, s') + \gamma V^*(s'))</script>
<p>where <script type="math/tex">V^*</script> is the value function under optimal policy, <script type="math/tex">r(s, a, s')</script> is the reward for taking action <em>a</em> in state <em>s</em> and ending up in <em>s’</em>, and <script type="math/tex">V^*(s')</script> is the value of state <em>s’</em>.</p>
<p>In other words, we take the action that maximizes our reward plus the discounted value of the state we end up in.</p>
<h2 id="gridworld-again">Gridworld Again</h2>
<p>Let’s compute the (optimal) state value function for our toy example. Working iteratively backward (using <script type="math/tex">\gamma = 0.9</script>, we can fill in our grid:</p>
<p><img src="/assets/img/rl_mario_sol" alt="" width="500" /></p>
<p>Obviously, this kind of brute-force approach won’t work in more complicated situations where we can’t go over all state-action combinations. It’s still a useful way to organize our thinking.</p>
<h2 id="resources">Resources</h2>
<p>Besides the lecture material, I relied <a href="https://joshgreaves.com/reinforcement-learning/understanding-rl-the-bellman-equations/">the following</a> guide to the Bellman equation and the <a href="https://en.wikipedia.org/wiki/Reinforcement_learning">wikipedia</a> entry for RL.</p>krisztiankovacsThese notes are based on Siraj Raval’s reinforcement learning course. This week was about basic concepts of markov decision processes (MDPs).