This project was the coursework for the Reinforcement Learning (RL) course I took as part of my MSc Artificial Intelligence at the University of Edinburgh. In a mix of several simple stochastic environments and the (deprecated) LARG/HFO half-field offense soccer environment, I implemented eight different reinforcement learning algorithms.

For the following algorithms, I followed pseudocode from Sutton and Barto (2018):

**Dynamic Programming**: Value iteration (page 83).**Monte Carlo**: On-policy first-visit MC control for \(\epsilon\)-soft policies (page 101).**SARSA**: On-policy temporal-difference control (page 130).**Q-Learning**: Off-policy temporal-difference control (page 131).

The following algorithm was quite difficult because it involved training a single network using multiple parallel agents each playing in a copy of the HFO environment:

**Deep Q-Learning**: Asynchronous 1-step Q-learning with function approximation, from Mnih et al. (2016); also involved implementing Hogwild parallized stochastic gradient descent from Recht et al. (2011).

Finally, these algorithms were in a two-agent setting:

**Independent Q-Learning**: The same algorithm as Q-learning, but with a joint state representation for the two agents.**Joint-Action Learning**: Based on table 4 from Bowling and Veloso (2001a).**Wolf-PHC**: Based on tables 1 and 2 from Bowling and Veloso (2001b).

I built my implementation on top of this provided base code, but I can’t share my own code. Most of my implementations reached the highest performance threshold defined by the coursework markers.