Explanation for use of `debias_loss`

(Ben Johnson) #1

Hi All

In the fit function in model.py, there are a few lines:

avg_loss = avg_loss * avg_mom + loss * (1-avg_mom)
debias_loss = avg_loss / (1 - avg_mom**batch_num)

Does anyone have a motivation for this transformation? It computes a debiased EMA of the loss (maybe inspired by this?) – but I don’t understand why you’d want to do that (esp. by default)

Specifically, I’m wondering about whether this is appropriate to use w/ the lr_find function. These are lr_findplots w/ using the EMA (as above) or the raw loss, respectively:


Raw loss is noisier (obviously), but the location of the minima are clearly shifted – about 2.0 for the EMA vs about 0.2 for the raw loss. I saw in the videos that @jeremy suggested finding the minimum and then going back about an order of magnitude – is this maybe because the EMA “delays” the minimum?

(Matthew Kleinsmith) #2

I’m also interested in the meaning of debias_loss. The TensorFlow fit method you linked to mentioned the Adam paper, https://arxiv.org/abs/1412.6980. In it they discuss “bias correction” (“debias”?). I’ll look into it more when I have time.

(Matthew Kleinsmith) #3

Here’s some discussion:

(Ben Johnson) #4

Yeah – ADAM uses a debiased estimate of the moving averages, but I’ve never seen it used in the reporting of the loss per batch as is done in the fastai library. After thinking about it more, I don’t think there’s any reason to do this besides preference for a smoothed loss.