fit function in model.py, there are a few lines:
avg_loss = avg_loss * avg_mom + loss * (1-avg_mom) debias_loss = avg_loss / (1 - avg_mom**batch_num)
Does anyone have a motivation for this transformation? It computes a debiased EMA of the loss (maybe inspired by this?) – but I don’t understand why you’d want to do that (esp. by default)
Specifically, I’m wondering about whether this is appropriate to use w/ the
lr_find function. These are
lr_findplots w/ using the EMA (as above) or the raw loss, respectively:
Raw loss is noisier (obviously), but the location of the minima are clearly shifted – about 2.0 for the EMA vs about 0.2 for the raw loss. I saw in the videos that @jeremy suggested finding the minimum and then going back about an order of magnitude – is this maybe because the EMA “delays” the minimum?