I wanted to bring this recent paper to @jeremy 's attention. It’s not explaining state of the art results but what I really liked about it was the strong intuition it helped me build about why SGD works, how it interacts with the LR and batch size and why the fastai approaches implementing SGD with restarts Cycle Len, Cycle Mult work so well. I think there might be some graphs and concepts in here that help explain SGD to future students of the course.
Two example concepts from the paper that really helped me:
“This suggests that the noise from a small mini-batch size facilitates exploration that may lead to better minima and that this is hard to achieve by changing the learning rate.”
"On the other hand, we find that the learning rate controls the height from the valley floor at which the optimization oscillates along the valley walls which is important for avoiding barriers along SGD’s path. "