This project was the coursework for the Reinforcement Learning (RL) course I took as part of my MSc Artificial Intelligence at the University of Edinburgh. In a mix of several simple stochastic environments and the (deprecated) LARG/HFO half-field offense soccer environment, I implemented eight different reinforcement learning algorithms.
For the following algorithms, I followed pseudocode from Sutton and Barto (2018):
- Dynamic Programming: Value iteration (page 83).
- Monte Carlo: On-policy first-visit MC control for \(\epsilon\)-soft policies (page 101).
- SARSA: On-policy temporal-difference control (page 130).
- Q-Learning: Off-policy temporal-difference control (page 131).
The following algorithm was quite difficult because it involved training a single network using multiple parallel agents each playing in a copy of the HFO environment:
- Deep Q-Learning: Asynchronous 1-step Q-learning with function approximation, from Mnih et al. (2016); also involved implementing Hogwild parallized stochastic gradient descent from Recht et al. (2011).
Finally, these algorithms were in a two-agent setting:
- Independent Q-Learning: The same algorithm as Q-learning, but with a joint state representation for the two agents.
- Joint-Action Learning: Based on table 4 from Bowling and Veloso (2001a).
- Wolf-PHC: Based on tables 1 and 2 from Bowling and Veloso (2001b).
I built my implementation on top of this provided base code, but I can't share my own code. Most of my implementations reached the highest performance threshold defined by the coursework markers.