Different implementations of differential learning rate (discriminative fine-tuning)


(Alex Lee) #1

In the course we pass an array of learning rates to fine tune layers of different levels, but in Howard and Ruder (2018) it says

38%20PM

Is this newly employed method preferable to the former? Also, \eta^{l} is the learning rate we found by using lr_find() right?