Mixed precision training

The benefit of FP16 is only visible when using modern GPUs like V100s. Also, you have to make sure all your tensors dimensions are multiple of 8s.

2 Likes

Thanks @sgugger !!

BTW, Do you mean the input batch tensors, or each tensor in the model, or both? needed to be multiple of 8s? If it is the case of tensor in the model, how about the word embedding dimensions?

Here leaving some info about the gpu I tested, it is P100.

Every dimension of the tensors you have (including embedding size, vocab size, hidden size…).

2 Likes

in computer vision ,we should have image sizes also multiple of 8s ?

Yes, if you want to see the full benefit of speed.

1 Like

Thanks …
If we use mixed precision of fai
Do we still need to install nvidia apex driver to support fais mixed precision api ??

No, you don’t need APEX to use mixed precision in fastai.

Hi, I am trying mixed precision training with Fastai, on a Google Cloud instance, and I am facing a big problem: it is not possible to save the learned model.
The returned error stack is:

TypeError Traceback (most recent call last)
/opt/conda/lib/python3.7/site-packages/torch/serialization.py in save(obj, f, pickle_module, pickle_protocol, _use_new_zipfile_serialization)
327 with _open_file_like(f, ‘wb’) as opened_file:
–> 328 _legacy_save(obj, opened_file, pickle_module, pickle_protocol)
329
/opt/conda/lib/python3.7/site-packages/torch/serialization.py in _legacy_save(obj, f, pickle_module, pickle_protocol)
395
–> 396 pickle_module.dump(MAGIC_NUMBER, f, protocol=pickle_protocol)
397 pickle_module.dump(PROTOCOL_VERSION, f, protocol=pickle_protocol)
TypeError: file must have a ‘write’ attribute
During handling of the above exception, another exception occurred:
AttributeError Traceback (most recent call last)
in
----> 1 learn.save()
/opt/conda/lib/python3.7/site-packages/fastai/basic_train.py in save(self, file, return_path, with_opt)
252 if not with_opt: state = get_model(self.model).state_dict()
253 else: state = {‘model’: get_model(self.model).state_dict(), ‘opt’:self.opt.state_dict()}
–> 254 torch.save(state, target)
255 if return_path: return target
256
/opt/conda/lib/python3.7/site-packages/torch/serialization.py in save(obj, f, pickle_module, pickle_protocol, _use_new_zipfile_serialization)
326
327 with _open_file_like(f, ‘wb’) as opened_file:
–> 328 _legacy_save(obj, opened_file, pickle_module, pickle_protocol)
329
330
/opt/conda/lib/python3.7/site-packages/torch/serialization.py in exit(self, *args)
205 class _open_buffer_writer(_opener):
206 def exit(self, *args):
–> 207 self.file_like.flush()
208
209
AttributeError: ‘NoneType’ object has no attribute ‘flush’

Training works correctly, and it is possible to export the model, but it is not possible to just call
learn.save().
output of show_install(0) is:

=== Software === 
python        : 3.7.6
fastai        : 1.0.61
fastprogress  : 0.2.2
torch         : 1.4.0
nvidia driver : 418.87
torch cuda    : 10.1 / is available
torch cudnn   : 7603 / is enabled

=== Hardware === 
nvidia gpus   : 1
torch devices : 1
  - gpu0      : 15079MB | Tesla T4

=== Environment === 
platform      : Linux-4.9.0-12-amd64-x86_64-with-debian-9.12
distro        : #1 SMP Debian 4.9.210-1 (2020-01-20)
conda env     : base
python        : /opt/conda/bin/python
sys.path      : /home/jupyter/cowc
/opt/conda/lib/python37.zip
/opt/conda/lib/python3.7
/opt/conda/lib/python3.7/lib-dynload

/opt/conda/lib/python3.7/site-packages
/opt/conda/lib/python3.7/site-packages/IPython/extensions
/home/jupyter/.ipython

Fri May 15 15:44:10 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.87.01    Driver Version: 418.87.01    CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   55C    P0    27W /  70W |   7565MiB / 15079MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0     11706      C   /opt/conda/bin/python                       7549MiB |
+-----------------------------------------------------------------------------+

Does anybody knows what is happening?

Can you try setting it back to full precision before saving it? IE .to_fp32()

I tried, same error.

Looks like the problem that I am facing is related to GCP not to mixed precision training, since it happens with the lesson one (peths dataset )notebook too, without using mixed precision training. There is something wrong with my configuration, even if it is the default one.