Mixed precision training

sgugger · April 18, 2019, 12:54pm

The benefit of FP16 is only visible when using modern GPUs like V100s. Also, you have to make sure all your tensors dimensions are multiple of 8s.

AlexanderChu · April 23, 2019, 3:46am

Thanks @sgugger !!

BTW, Do you mean the input batch tensors, or each tensor in the model, or both? needed to be multiple of 8s? If it is the case of tensor in the model, how about the word embedding dimensions?

Here leaving some info about the gpu I tested, it is P100.

sgugger · April 23, 2019, 12:43pm

Every dimension of the tensors you have (including embedding size, vocab size, hidden size…).

champs.jaideep · August 18, 2019, 9:21am

in computer vision ,we should have image sizes also multiple of 8s ?

sgugger · August 19, 2019, 7:32am

Yes, if you want to see the full benefit of speed.

champs.jaideep · August 20, 2019, 4:45am

Thanks …
If we use mixed precision of fai
Do we still need to install nvidia apex driver to support fais mixed precision api ??

sgugger · August 20, 2019, 3:02pm

No, you don’t need APEX to use mixed precision in fastai.

django1 · May 15, 2020, 3:48pm

Hi, I am trying mixed precision training with Fastai, on a Google Cloud instance, and I am facing a big problem: it is not possible to save the learned model.
The returned error stack is:

TypeError Traceback (most recent call last)
/opt/conda/lib/python3.7/site-packages/torch/serialization.py in save(obj, f, pickle_module, pickle_protocol, _use_new_zipfile_serialization)
327 with _open_file_like(f, ‘wb’) as opened_file:
→ 328 _legacy_save(obj, opened_file, pickle_module, pickle_protocol)
329
/opt/conda/lib/python3.7/site-packages/torch/serialization.py in _legacy_save(obj, f, pickle_module, pickle_protocol)
395
→ 396 pickle_module.dump(MAGIC_NUMBER, f, protocol=pickle_protocol)
397 pickle_module.dump(PROTOCOL_VERSION, f, protocol=pickle_protocol)
TypeError: file must have a ‘write’ attribute
During handling of the above exception, another exception occurred:
AttributeError Traceback (most recent call last)
in
----> 1 learn.save()
/opt/conda/lib/python3.7/site-packages/fastai/basic_train.py in save(self, file, return_path, with_opt)
252 if not with_opt: state = get_model(self.model).state_dict()
253 else: state = {‘model’: get_model(self.model).state_dict(), ‘opt’:self.opt.state_dict()}
→ 254 torch.save(state, target)
255 if return_path: return target
256
/opt/conda/lib/python3.7/site-packages/torch/serialization.py in save(obj, f, pickle_module, pickle_protocol, _use_new_zipfile_serialization)
326
327 with _open_file_like(f, ‘wb’) as opened_file:
→ 328 _legacy_save(obj, opened_file, pickle_module, pickle_protocol)
329
330
/opt/conda/lib/python3.7/site-packages/torch/serialization.py in exit(self, *args)
205 class _open_buffer_writer(_opener):
206 def exit(self, *args):
→ 207 self.file_like.flush()
208
209
AttributeError: ‘NoneType’ object has no attribute ‘flush’

Training works correctly, and it is possible to export the model, but it is not possible to just call
learn.save().
output of show_install(0) is:

=== Software === 
python        : 3.7.6
fastai        : 1.0.61
fastprogress  : 0.2.2
torch         : 1.4.0
nvidia driver : 418.87
torch cuda    : 10.1 / is available
torch cudnn   : 7603 / is enabled

=== Hardware === 
nvidia gpus   : 1
torch devices : 1
  - gpu0      : 15079MB | Tesla T4

=== Environment === 
platform      : Linux-4.9.0-12-amd64-x86_64-with-debian-9.12
distro        : #1 SMP Debian 4.9.210-1 (2020-01-20)
conda env     : base
python        : /opt/conda/bin/python
sys.path      : /home/jupyter/cowc
/opt/conda/lib/python37.zip
/opt/conda/lib/python3.7
/opt/conda/lib/python3.7/lib-dynload

/opt/conda/lib/python3.7/site-packages
/opt/conda/lib/python3.7/site-packages/IPython/extensions
/home/jupyter/.ipython

Fri May 15 15:44:10 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.87.01    Driver Version: 418.87.01    CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   55C    P0    27W /  70W |   7565MiB / 15079MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0     11706      C   /opt/conda/bin/python                       7549MiB |
+-----------------------------------------------------------------------------+

Does anybody knows what is happening?

muellerzr · May 15, 2020, 3:53pm

Can you try setting it back to full precision before saving it? IE .to_fp32()

django1 · May 15, 2020, 4:00pm

I tried, same error.

django1 · May 18, 2020, 7:23am

Looks like the problem that I am facing is related to GCP not to mixed precision training, since it happens with the lesson one (peths dataset )notebook too, without using mixed precision training. There is something wrong with my configuration, even if it is the default one.