From the fastai code it seems that momentum based SGD is used for training the language model for AWD_LSTM. But AFAIK the paper recommends not using momentum and instead proposes a NT-ASGD optimiser. Is my understanding correct? If so why is NT-ASGD not used in fastai code?
1 Like