Different implementations of differential learning rate (discriminative fine-tuning)

alwc · May 16, 2018, 9:42am

In the course we pass an array of learning rates to fine tune layers of different levels, but in Howard and Ruder (2018) it says

38%20PM

Is this newly employed method preferable to the former? Also, \eta^{l} is the learning rate we found by using lr_find() right?