How Reinforcement Learning concepts can help us with our New Year goals

Article - 11.01.2022

Reinforcement Learning, or RL, is a subtopic of Artificial Intelligence. The goal of RL is to maximise some reward for a given environment. What this translates to is, essentially, making algorithms work out how to win at games without telling them how to play, just what the score is. RL is not a new concept by any means but has exploded in recent years. It even entered the public consciousness back in 2016 with AlphaGo. This was an RL-utilising program which managed to beat the world’s greatest Go player.

When we want a new RL program to learn a game, we run into something called the Credit Assignment Problem. This is a part of the human experience which required formalisation for use in RL. To explain this, and how it fits into our New Year goals, let’s go over a simple example with the cart-pole game, as excellently described in Hands on Machine Learning by Aurélien Géron:

GIF from this link.

As above, the environment is a simple simulation of a pole which pivots on a cart. The actions that our program can make are to move the cart left or right. The reward is a point for every second the pole is not fallen over whilst the cart is still within some boundary. At each step, say every second, the program gets observations about the system: the angle/angular velocity of the pole and the position/velocity of the cart. The program can then take an action: to accelerate the cart to the left or right, this is determined by the current policy of the program.

At first, the program is terrible at balancing the pole. It’s not like us humans who intuitively understand physics and move the cart in the opposite direction the pole is falling. The RL program will try literally random actions at first. It does get better and eventually can balance the pole so well that it reaches the time limit so maximizing the reward. How does it do this? Every time it plays the game, it records every action taken at every step. It runs a batch of games, analyses these to see which were most successful. It then tweaks its policy to match these better. Repeat this process for a hundred or so batches and it learns how to play the game.

This process is a lot easier said than done. More precisely, if a game results in failure, how do you work out which action at which step was to blame? It would be unfair to put all the blame on the very last action. If the pole is 99% fallen over at that step it wouldn’t matter what if the last action was correct or not. The outcome is really the product of a series of past actions. However, if it failed at 00:59, it would be silly to weight the actions at 00:01 the same as at 00:58. The problem is how to assign what weights to past steps to ensure the program can learn from these experiences. This is known as the “Credit Assignment Problem”, and we can apply it to our own New Year resolutions.

Take an example New Year’s goal of running more. When we don’t go for a run as intended, it is easy to blame our actions at the last step, i.e. “I just don’t have the motivation for running”. This is something of a goal-killer as, unlike our program, we can’t tweak our policy instantly. However, we could use the RL lens to review the observations at the last step and see it was a cold, rainy, pre-sunrise morning. It would have taken an incredible policy to get you out under those conditions so weighting the failure entirely on this step is simply unfair.

Instead, we could review the previous steps and see if we need to tweak their weights. Say 48hrs prior to the (intended) run, was the action of checking the weather for the week ahead taken? Were forecasted dry times checked against when you could possibly run? If not, maybe the weights for these steps should be increased. If it was known that rain was forecast for the run, was your running jacket checked to be clean 24hrs before? Was it laid out ready for the morning 12hrs before? Again, it might be that these weights need to be tweaked. Whilst these actions take place 3 days before the final run/don’t run step, they have a tremendous influence on the outcome. It is far easier to go for a lunchtime run when the conditions are dry and bright outside!

Of course, there are many other factors which affect the success of New Year goals, and I am not going to sit here and say that Reinforcement Learning has all the answers to you learning Spanish/being a better parent/quitting smoking/taking up smoking. Completely unexpected things can come into our real-world environment, we don’t get thousands of no-cost attempts etc… All I am saying is it might be worth evaluating the little actions many steps ago, not just the final one.

Rory Morrison