Exploration of Neural Network Learning Algorithms and Regularization

For the first coursework of the Machine Learning Practical (MLP) course I took as part of my MSc Artificial Intelligence at the University of Edinburgh, we experimented with different ways of traiming a convolutional neural network to recognize handwritten digits in the EMNIST dataset.

I implemented and experimented with the following variations from the baseline:

Two to five hidden layers. I kept this constant at three for other experiments.
Different learning rules: standard stochastic gradient descent, RMSProp (slide 26 in Tieleman and Hinton (2012)), and Adam (algorithm 1 in Kingma and Ba (2015)).
Different learning rate schedules: a constant learning rate and cosine annealing with and without warm restarts (equation 7 in Loshchilov and Hutter (2017)).
Different regularization methods: L2 regularization and weight decay.

I also experimented with different hyperparameter settings for different combinations of these variations. From my experiments, I concluded that the Adam rule worked best for a three-layer model, and that L2 regularization helped speed up learning.

For more details, including training curves and results tables, see my full report below (or download it as PDF).

One thing I learned from this coursework is that I should have been plotting training curves for both the training and validation data sets as well, since training set curves alone do not say much about whether the model is still learning generalizable information. Another is that I should only report final performance on the test set, not training curves.

Please note that my report above is reproduced as-is, and does not incorporate these learnings. My next coursework for MLP does incorporate them.