Using fp16 in Lesson3-Planet dataset

Hi guys,

I’ve been trying to used mixed precision training on the planet dataset, it all seems to run fine but the loss and accuracy results I’m getting are quite different when compared to not using mixed training:

#load trained model
arch = models.resnet50
#acc_02 is a partial new function that calls accuracy_thresh with the parameter thresh=0.2
acc_02 = partial(accuracy_thresh, thresh=0.2)
f_score = partial(fbeta, thresh=0.2)
#learn = cnn_learner(data, arch, metrics=[acc_02, f_score]).to_fp16() #mixed precision
learn = cnn_learner(data, arch, metrics=[acc_02, f_score]) # without mixed precision

#We use the LR Finder to pick a good learning rate
learn.lr_find()
learn.recorder.plot()
lr=0.001
learn.fit_one_cycle(1, slice(lr))

Results without mixed precision

Results with mixed precision

Has anyone com across a similar issue?

Have you also tried different Notebooks?

I tested fp16 for the dog breeds and camvid and there was no real difference. Maybe you can try these as well?

Hi Patrick,

I’ve just tested fp16 on the camvid notebook and the fp16 results are totally different:
With fp16

Without fp16

Here is the output of show_install, hope someone can help…

=== Software === 
python        : 3.6.8
fastai        : 1.0.50.post1
fastprogress  : 0.1.20
torch         : 1.0.1.post2
nvidia driver : 418.43
torch cuda    : 10.0.130 / is available
torch cudnn   : 7402 / is enabled

=== Hardware === 
nvidia gpus   : 1
torch devices : 1
  - gpu0      : 7949MB | GeForce RTX 2070

=== Environment === 
platform      : Linux-4.18.0-16-generic-x86_64-with-debian-buster-sid
distro        : Ubuntu 18.04 bionic
conda env     : Unknown
python        : /home/hbenitez/anaconda3/envs/fastai/bin/python
sys.path      : 
/home/hbenitez/anaconda3/envs/fastai/lib/python36.zip
/home/hbenitez/anaconda3/envs/fastai/lib/python3.6
/home/hbenitez/anaconda3/envs/fastai/lib/python3.6/lib-dynload
/home/hbenitez/anaconda3/envs/fastai/lib/python3.6/site-packages
/home/hbenitez/anaconda3/envs/fastai/lib/python3.6/site-packages/IPython/extensions

Tue Mar 26 21:53:54 2019       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.43       Driver Version: 418.43       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce RTX 2070    Off  | 00000000:1F:00.0 Off |                  N/A |
| 17%   36C    P8    23W / 175W |   4509MiB /  7949MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0     18763      C   ...enitez/anaconda3/envs/fastai/bin/python  4499MiB |
+-----------------------------------------------------------------------------+

Here is my show_install, there are some differences for sure:

=== Software ===
python : 3.6.4
fastai : 1.0.49
fastprogress : 0.1.20
torch : 1.0.1.post2
nvidia driver : 418.39
torch cuda : 10.0.130 / is available
torch cudnn : 7402 / is enabled

=== Hardware ===
nvidia gpus : 2
torch devices : 2

  • gpu0 : 32480MB | Tesla V100-PCIE-32GB
  • gpu1 : 32480MB | Tesla V100-PCIE-32GB

=== Environment ===
platform : Linux-4.15.0-20-generic-x86_64-with-debian-stretch-sid
distro : #21-Ubuntu SMP Tue Apr 24 06:16:15 UTC 2018
conda env : Unknown
python : /opt/conda/bin/python

I also rerun the two options:

With to_fp16():

epoch train_loss valid_loss acc_camvid time
0 0.981239 1.032274 0.750549 01:48
1 0.901010 0.648565 0.842442 01:34
2 0.669481 0.477927 0.870625 01:34

Without:

epoch train_loss valid_loss acc_camvid time
0 0.939555 0.714685 0.828325 02:25
1 0.811918 0.644321 0.841052 02:13
2 0.656373 0.475579 0.883946 02:14