Welcome to the new Community Conundrum! This week you’ll train your first deep reinforcement learning agent to jump over walls.
If you don’t know anything about reinforcement learning, don’t worry, it’s a beginner-friendly conundrum. And good news, you’ll not need GPU.
Our goal is to train our agent (blue cube) to go on the green tile.
However, there are three situations:
No Wall situation
Small Wall Situation
Big Wall Situation
We’ll learn two different policies (behaviors) depending on the height of the wall:
The reward system is:
In terms of observation, we don’t use normal vision (frames) but 14 raycasts that can each detect 4 possible objects. Think of raycasts as lasers that will detect if it passes through object.
We also use the global position of the agent and whether or not is grounded.
Source: Unity ML-Agents Documentation
The action space is discrete with 4 branches:
Our goal is to hit the benchmark with a mean reward of 0.8.
We’ll use Deep Reinforcement Learning to solve this problem.
What is Deep Reinforcement Learning? Well this article will give you what you need to do to make this conundrum: https://academy.dataiku.com/reinforcement-learning-open/513305
More precisely, we’ll use a deep reinforcement learning algorithm called PPO and your goal will be to tune the hyperparameters to build a smart agent.
PPO is an Actor-Critic Algorithm. Actor Critic is a smart method: imagine you play a video game with a friend that provides you some feedback. You’re the Actor and your friend is the Critic.
At the beginning, you don’t know how to play, so you try some action randomly. Your friend, the Critic observes your action and provides feedback.
Learning from this feedback, you’ll update your policy and be better at playing that game.
On the other hand, your friend (Critic) will also update their own way to provide feedback so it can be better next time.
As we can see, the idea of Actor-Critic is to have two neural networks. We estimate both:
ACTOR: A policy function, controls how our agent acts.
CRITIC: A value function, measures how good these actions are.
Both run in parallel.
Our config file looks like this:
For this exercise you’ll have to think about:
The total number of steps (observation collected and action taken) that must be taken in the environments before ending the training process.
Hint: We used 300k training steps to reach the 0.8 baseline
The number of experiences (observation collected, action taken, reward and next state) fed in each iteration of gradient descent.
Hint: The typical range is from 32 to 512 (always a multiple so 32, 64, 128, 256, 512).
The number of units in the hidden layers of the neural network.
This number grows if the action is a very complex interaction between the observation variables
Hint: The typical range is from 64 to 512.
The number of hidden layers in the neural network. Corresponds to how many hidden layers are present after the observation input, or after the CNN encoding of the visual observation.
For simple problems, fewer layers are likely to train faster and more efficiently. More layers may be necessary for more complex control problems.
Hint: The typical range is from 2 to 4.
Discount factor for future rewards coming from the environment.
This can be thought of as how far into the future the agent should care about possible rewards.
In situations when the agent should be acting in the present in order to prepare for rewards in the distant future, this value should be large.
In cases when rewards are more immediate, it can be smaller.
Typical range: 0.8 - 0.995
You need to modify the config file values based on your hypothesis. Remember that the best way to learn is to be active by experimenting. So you should try to make some hypotheses and verify them.
You’re now ready to launch the training. You need to type in the terminal:
mlagents-learn ./config/conundrum_config.yaml --env=./WINDOWS/train/WallJump --run-id=run --train
mlagents-learn ./config/conundrum_config.yaml --env=./MAC/train/WallJump --run-id=run --train
This will launch the game and you can see your agent performing, on the terminal you can see the training infos.
When the mean Reward reaches 0.8 you can stop the training, it will output the saved models in ./models
That’s all for today! You’ve just trained an agent that learns to jump over walls. Awesome!
If you would like to share your answer please do upload your saved models and the config file here!
I hope you liked this introduction to deep reinforcement learning conundrum, if you want to dive deeper into Reinforcement Learning you can check these 2 articles:
Keep learning, stay awesome!
I was thinking about getting on a zoom call and see what we can do about figuring this out together. I'm USA Eastern Time (Currently GMT -4) Is there a good time to work together?
Hi Tom, I'm on GMT +7. Somewhere like 12 hours ahead of you in the USA. But no worries, drop me a message, keen to explore the opportunities further. Let me know.