Mixed precision training

Were you able to compile Apex with cuda extension? If yes, which version of gcc and nvcc do you have? Thanks.

It could be the absence of loss scaling. Not sure, though.

I’ve done some more testing with loss scaling of 128, 1024, the default 512 and dynamic loss scaling without success. I opened an issue here. I wonder if fastai offers some way of printing out gradients while training for debugging purposes.

1 Like

Like I said on the issue, I didn’t manage to reproduce. Note that you shouldn’t pass any loss_scale but use dynamic loss scaling as it works better.

1 Like

So, dynamic loss scaling is actually implemented. Thanks.

For me, FP16 works rather well (vanilla fastai), but convergence is a bit delayed w.r.t. FP32 or wrt a fastai env in which apex is also installed. Tested on tesla v100 and 1080ti.

Yes, it’s actually the default in the Callback but not the to_fp16 function I just realized. Just fixed it in master so it’s the default for everywhere now.

Note that you will see a few iterations with no training because of the way dynamic loss scaling works: it starts with a really high scale that is divided by 2 as long as your overflow.

2 Likes

Perfect, now it should work without any delay. I was experimenting such delay since I always used to_fp16.

Thanks!

Oh, btw. For Apex, I’m using gcc 8.2.1 and cuda 10. They work perfectly.

1 Like

So, the bug is in lr_find(), and the dev build from master branch now should have no issues anymore. Here’s the issue that refers to the bug: https://github.com/fastai/fastai/issues/1903

Thanks @sgugger for fixing it.

1 Like

I have tested fp16 using

learn = language_model_learner(data_lm, TransformerXL).to_fp16()
learn = language_model_learner(data_lm, TransformerXL).to_fp16(dynamic=False)    
learn = language_model_learner(data_lm, TransformerXL)

using 1000 rows train, 100 valid
fp16,dynamic=True, time = 04:23
fp16,dynamic=False, time = 04:20
no fp16 time = 00:51

Why it is slower in fp16 mode?
Thanks!

The benefit of FP16 is only visible when using modern GPUs like V100s. Also, you have to make sure all your tensors dimensions are multiple of 8s.

2 Likes

Thanks @sgugger !!

BTW, Do you mean the input batch tensors, or each tensor in the model, or both? needed to be multiple of 8s? If it is the case of tensor in the model, how about the word embedding dimensions?

Here leaving some info about the gpu I tested, it is P100.

Every dimension of the tensors you have (including embedding size, vocab size, hidden size…).

2 Likes

in computer vision ,we should have image sizes also multiple of 8s ?

Yes, if you want to see the full benefit of speed.

1 Like

Thanks …
If we use mixed precision of fai
Do we still need to install nvidia apex driver to support fais mixed precision api ??

No, you don’t need APEX to use mixed precision in fastai.

Hi, I am trying mixed precision training with Fastai, on a Google Cloud instance, and I am facing a big problem: it is not possible to save the learned model.
The returned error stack is:

TypeError Traceback (most recent call last)
/opt/conda/lib/python3.7/site-packages/torch/serialization.py in save(obj, f, pickle_module, pickle_protocol, _use_new_zipfile_serialization)
327 with _open_file_like(f, ‘wb’) as opened_file:
–> 328 _legacy_save(obj, opened_file, pickle_module, pickle_protocol)
329
/opt/conda/lib/python3.7/site-packages/torch/serialization.py in _legacy_save(obj, f, pickle_module, pickle_protocol)
395
–> 396 pickle_module.dump(MAGIC_NUMBER, f, protocol=pickle_protocol)
397 pickle_module.dump(PROTOCOL_VERSION, f, protocol=pickle_protocol)
TypeError: file must have a ‘write’ attribute
During handling of the above exception, another exception occurred:
AttributeError Traceback (most recent call last)
in
----> 1 learn.save()
/opt/conda/lib/python3.7/site-packages/fastai/basic_train.py in save(self, file, return_path, with_opt)
252 if not with_opt: state = get_model(self.model).state_dict()
253 else: state = {‘model’: get_model(self.model).state_dict(), ‘opt’:self.opt.state_dict()}
–> 254 torch.save(state, target)
255 if return_path: return target
256
/opt/conda/lib/python3.7/site-packages/torch/serialization.py in save(obj, f, pickle_module, pickle_protocol, _use_new_zipfile_serialization)
326
327 with _open_file_like(f, ‘wb’) as opened_file:
–> 328 _legacy_save(obj, opened_file, pickle_module, pickle_protocol)
329
330
/opt/conda/lib/python3.7/site-packages/torch/serialization.py in exit(self, *args)
205 class _open_buffer_writer(_opener):
206 def exit(self, *args):
–> 207 self.file_like.flush()
208
209
AttributeError: ‘NoneType’ object has no attribute ‘flush’

Training works correctly, and it is possible to export the model, but it is not possible to just call
learn.save().
output of show_install(0) is:

=== Software === 
python        : 3.7.6
fastai        : 1.0.61
fastprogress  : 0.2.2
torch         : 1.4.0
nvidia driver : 418.87
torch cuda    : 10.1 / is available
torch cudnn   : 7603 / is enabled

=== Hardware === 
nvidia gpus   : 1
torch devices : 1
  - gpu0      : 15079MB | Tesla T4

=== Environment === 
platform      : Linux-4.9.0-12-amd64-x86_64-with-debian-9.12
distro        : #1 SMP Debian 4.9.210-1 (2020-01-20)
conda env     : base
python        : /opt/conda/bin/python
sys.path      : /home/jupyter/cowc
/opt/conda/lib/python37.zip
/opt/conda/lib/python3.7
/opt/conda/lib/python3.7/lib-dynload

/opt/conda/lib/python3.7/site-packages
/opt/conda/lib/python3.7/site-packages/IPython/extensions
/home/jupyter/.ipython

Fri May 15 15:44:10 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.87.01    Driver Version: 418.87.01    CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   55C    P0    27W /  70W |   7565MiB / 15079MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0     11706      C   /opt/conda/bin/python                       7549MiB |
+-----------------------------------------------------------------------------+

Does anybody knows what is happening?

Can you try setting it back to full precision before saving it? IE .to_fp32()

I tried, same error.

Looks like the problem that I am facing is related to GCP not to mixed precision training, since it happens with the lesson one (peths dataset )notebook too, without using mixed precision training. There is something wrong with my configuration, even if it is the default one.