A Walk with SGD: Yoshua Bengio Et Al. Paper That Helps Explain Why/How SGD Works Well for DL

(Will) #1

A Walk with SGD

I wanted to bring this recent paper to @jeremy 's attention. It’s not explaining state of the art results but what I really liked about it was the strong intuition it helped me build about why SGD works, how it interacts with the LR and batch size and why the fastai approaches implementing SGD with restarts Cycle Len, Cycle Mult work so well. I think there might be some graphs and concepts in here that help explain SGD to future students of the course.

Two example concepts from the paper that really helped me:

“This suggests that the noise from a small mini-batch size facilitates exploration that may lead to better minima and that this is hard to achieve by changing the learning rate.”

"On the other hand, we find that the learning rate controls the height from the valley floor at which the optimization oscillates along the valley walls which is important for avoiding barriers along SGD’s path. "

(Jason McGhee) #2

Just a friendly note- and please correct me if I’m speaking out of turn here, but I believe Jeremy has requested not to be tagged unless you have a pressing question specifically for him.

He’s very good at replying, I’m sure he’d see this post without the tag.

(Will) #3

Not out of turn at all. I hadn’t seen him request that but it makes perfect sense. Easy to get swamped by notifications on the internet. My apologies Jeremy. @jsonm let me know if you think it’s better to remove the tag or leave it since it’s already done.