I recall Jeremy saying that it might be useful to have a scheduler for batchsize dropout etc. as well. I was wondering there was any experiments that showed (similar to how Jeremy showed that batchnorm made the loss smoother) intuitively why the LR finder and and schedulers (such as one cycle) worked better, or was it through empirical tests? as I would be curious to run the same tests for other hyperparameters for the network.
The papers above have lots of experimental results.
I went to a public discussion between Jeremy and Leslie Smith. Leslie Smith is a very experiment-driven researcher, so generally speaking most of his work is driven by experiments. I believe he said something along the lines of, run experiments first, come up with theoretical explanation for paper after getting results.
sucks that there is no way to get an intuitive understanding of whether something may help improve training NN’s like what Jeremy did with exploring the mean and sd of outputs of the NN to improve weight initialization. I suppose going the experimentation route is the best (albeit most time consuming) way to go. thanks for your help!
In general I believe a lot of papers look at the standard deviation of a neural net, so that will actually carry you far. Heatmaps from part1 are also used. Though, there is really no replacement for lots of experiments.
When you talk about standard deviation of the NN, do you mean running an experiment multiple times and look at the standard deviation of the results (i.e. std dev of accuracy on cifar 10)?
Sorry, standard deviation of the weights.