Sam Stites

Thoughts about Hyperparameters in DL

There are really two classes of these: optimizer hyperparameters (learning rate, number epochs, minibatch size, optimizer), and model hyperparameters (number and size of layers).

From what I’ve seen optimizer hyperparameters are the parameters where there will one day be algorithm solutions to removing them from your model. Things like learning rate finders and cyclical learning rates (Cyclical Learning Rates for Training Neural Networks, Smith), adaptive batch sizes (AdaBatch, Devarakonda, et al), and convergence properties (really just borrowing from reinforcement learning literature, but Super-Convergence - Smith, Topin seems to include a proof for convergence) are lines of research with varying sucess at removing these. Optimizer hyperparameters can also be ingrated directly into the optimizer itself, as can be seen with Adam and Adagrad for learning rates. On the other hand, neural architecture search (NAS) -based techniques are coming more popular with packages like AutoKeras to eliminate model hyperparameters.

Other notes:

Also, also, also, also, also, also.