To_fp16: UserWarning: You have a 'loss_scale' factor that is too high

aftertouch · February 18, 2019, 6:26pm

Hi,

I am currently getting a warning which I have never encountered before when using the ‘to_fp16’ function

/home/me/anaconda3/envs/fastai-usr/lib/python3.7/site-packages/fastai/callbacks/fp16.py:97: UserWarning: You have a `loss_scale` factor that is too high, try to divide it by 2 (current value: 512).
  warn(f"You have a `loss_scale` factor that is too high, try to divide it by 2 (current value: {self.loss_scale}).")

I am using Fastai v1.0.45. I created a language model learner using language_model_learner(data_lm, AWD_LSTM, drop_mult=0.3) and call to_fp16 on it. Then when I run lr_find, this warning prints constantly. Changing the loss_scale parameter of to_fp16 doesn’t seem to have any effect on this warning, as I set it as low as .125 and still got the warning.

It doesn’t seem to have any effect on the results of lr_find, as I get similar results to similar code run on the same data in previous versions of fastai, but I’m not sure if it’s something I should be concerned with.

jls · February 19, 2019, 3:49am

Same here.

sgugger · February 19, 2019, 3:05pm

Indeed! This is now fixed in master, and was due to the regularization (AR and TAR) being computed in half precision instead of full precision.

aftertouch · February 19, 2019, 3:07pm

Does this mean i should re-run any experiments that were conducted while this bug was occurring?

sgugger · February 19, 2019, 3:13pm

That would be safer, yes.

aftertouch · February 22, 2019, 3:11pm

So to clarify, this will be fixed in the next release of fastai?

sgugger · February 22, 2019, 3:40pm

Yes, and in the meantime it’s in master.

turntwo463 · March 9, 2019, 4:54pm

I am seeing this same error when doing fit_one_cycle as part of using a unet_learner after 6 of 10 epochs.

/opt/anaconda3/lib/python3.7/site-packages/fastai/callbacks/fp16.py:97: UserWarning: You have a loss_scale factor that is too high, try to divide it by 2 (current value: 512).
warn(f"You have a loss_scale factor that is too high, try to divide it by 2 (current value: {self.loss_scale}).")

Learning rate and loss were both low before the error:

lr=7e-4

epoch	train_loss	valid_loss	dice	dice	time
0	0.105225	0.108841	0.700169	0.800594	05:27
6	0.094356	0.097013	0.704540	0.815099	05:23

This is a GCP instance.

=== Software ===
python        : 3.7.1
fastai        : 1.0.47.post1
fastprogress  : 0.1.20
torch         : 1.0.1.post2
nvidia driver : 410.72
torch cuda    : 10.0.130 / is available
torch cudnn   : 7402 / is enabled

=== Hardware ===
nvidia gpus   : 1
torch devices : 1
  - gpu0      : 15079MB | Tesla T4

=== Environment ===
platform      : Linux-4.9.0-8-amd64-x86_64-with-debian-9.8
distro        : #1 SMP Debian 4.9.130-2 (2018-10-27)
conda env     : base
python        : /opt/anaconda3/bin/python
sys.path      :
/opt/anaconda3/lib/python37.zip
/opt/anaconda3/lib/python3.7
/opt/anaconda3/lib/python3.7/lib-dynload
/opt/anaconda3/lib/python3.7/site-packages

Sat Mar  9 16:44:06 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.72       Driver Version: 410.72       CUDA Version: 10.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|


|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   56C    P0    28W /  70W |   2553MiB / 15079MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0     17757      C   /opt/anaconda3/bin/python                   2543MiB |
+-----------------------------------------------------------------------------+

aftertouch · March 11, 2019, 1:57pm

I get the warning occasionally now, but what it often means is that the model’s loss has diverged and it needs to be set to a lower learning rate. Doesn’t look like that’s what’s happening in your case?

outfuture · March 14, 2019, 5:25pm

After I increase the batch size to a certain degree I’m getting the same warning while trying to do lr_find on a u-net with fp_16(). Any clues why this happens? I’m running fastai 1.0.48

sgugger · March 15, 2019, 2:24am

You should use dynamic=True in fp16 (should be the default now) as it handles those automatically.