To_fp16: UserWarning: You have a 'loss_scale' factor that is too high

Hi,

I am currently getting a warning which I have never encountered before when using the ‘to_fp16’ function

/home/me/anaconda3/envs/fastai-usr/lib/python3.7/site-packages/fastai/callbacks/fp16.py:97: UserWarning: You have a `loss_scale` factor that is too high, try to divide it by 2 (current value: 512).
  warn(f"You have a `loss_scale` factor that is too high, try to divide it by 2 (current value: {self.loss_scale}).")

I am using Fastai v1.0.45. I created a language model learner using language_model_learner(data_lm, AWD_LSTM, drop_mult=0.3) and call to_fp16 on it. Then when I run lr_find, this warning prints constantly. Changing the loss_scale parameter of to_fp16 doesn’t seem to have any effect on this warning, as I set it as low as .125 and still got the warning.

It doesn’t seem to have any effect on the results of lr_find, as I get similar results to similar code run on the same data in previous versions of fastai, but I’m not sure if it’s something I should be concerned with.

4 Likes

Same here.

Indeed! This is now fixed in master, and was due to the regularization (AR and TAR) being computed in half precision instead of full precision.

Does this mean i should re-run any experiments that were conducted while this bug was occurring?

That would be safer, yes.

So to clarify, this will be fixed in the next release of fastai?

Yes, and in the meantime it’s in master.

I am seeing this same error when doing fit_one_cycle as part of using a unet_learner after 6 of 10 epochs.

/opt/anaconda3/lib/python3.7/site-packages/fastai/callbacks/fp16.py:97: UserWarning: You have a loss_scale factor that is too high, try to divide it by 2 (current value: 512).
warn(f"You have a loss_scale factor that is too high, try to divide it by 2 (current value: {self.loss_scale}).")

Learning rate and loss were both low before the error:

lr=7e-4

epoch train_loss valid_loss dice dice time
0 0.105225 0.108841 0.700169 0.800594 05:27
6 0.094356 0.097013 0.704540 0.815099 05:23

This is a GCP instance.

=== Software ===
python        : 3.7.1
fastai        : 1.0.47.post1
fastprogress  : 0.1.20
torch         : 1.0.1.post2
nvidia driver : 410.72
torch cuda    : 10.0.130 / is available
torch cudnn   : 7402 / is enabled

=== Hardware ===
nvidia gpus   : 1
torch devices : 1
  - gpu0      : 15079MB | Tesla T4

=== Environment ===
platform      : Linux-4.9.0-8-amd64-x86_64-with-debian-9.8
distro        : #1 SMP Debian 4.9.130-2 (2018-10-27)
conda env     : base
python        : /opt/anaconda3/bin/python
sys.path      :
/opt/anaconda3/lib/python37.zip
/opt/anaconda3/lib/python3.7
/opt/anaconda3/lib/python3.7/lib-dynload
/opt/anaconda3/lib/python3.7/site-packages

Sat Mar  9 16:44:06 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.72       Driver Version: 410.72       CUDA Version: 10.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|


|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   56C    P0    28W /  70W |   2553MiB / 15079MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0     17757      C   /opt/anaconda3/bin/python                   2543MiB |
+-----------------------------------------------------------------------------+

I get the warning occasionally now, but what it often means is that the model’s loss has diverged and it needs to be set to a lower learning rate. Doesn’t look like that’s what’s happening in your case?

After I increase the batch size to a certain degree I’m getting the same warning while trying to do lr_find on a u-net with fp_16(). Any clues why this happens? I’m running fastai 1.0.48

You should use dynamic=True in fp16 (should be the default now) as it handles those automatically.

1 Like