Yes, multi-GPU doesn’t work (yet) with RNNs. This is a standing issue we will address after the course is finished.
thanks for your answer and your time sgungger.
I found an alternate way to remove DataParallel
from a saved learner. I have no idea if this a legitimate way to go about this, but none of the methods in this thread, e.g., modifying the state_dict
, were working for me.
Also, somehow I had managed to save the learner.pth
file with DataParallel
wrapped around it twice, so to load it I needed to call
learn.model = torch.nn.DataParallel(learn.model)
twice, though this works without the double wrap.
Calling
learn.model
showed that DataParallel
was indeed wrapping the model, but I could access the non-parallel part with
learn.model.module.module
(you may omit the second .module if its not there twice), so I tried:
learn.model = learn.model.module.module
and got back the unwrapped model which then functioned as normal.
learn.get_preds()
worked like it should on one gpu.
I do not know if this has any side-effects, hopefully someone who knows something can comment.
Thanks for the amazing resources!!
When training CNN on multiple GPU via
learn = torch.nn.DataParallel(learn, device_ids=[0, 1])
After savingthe model and then loading the model for prediction I am getting this error:
learn.load(Name)
AttributeError: ‘DataParallel’ object has no attribute ‘load’
How should I solve this issue?
Your first line is incorrect, it should be:
learn.model = torch.nn.DataParallel(learn.model, device_ids=[0, 1])
Thanks for your reply. this works now but I cannot save the model that is being trained via DataParallel. what am I doing wrong here? :
mName_S = str('bestModel_+ aName)
learnS = create_cnn(data, arch,
metrics=[accuracy, error_rate],
callback_fns=[partial(CSVLogger, filename =str(‘stat_’ +str(tr)+‘S’+ aName)), ShowGraph,
partial(SaveModelCallback, monitor =‘val_loss’, mode =‘auto’, name = mName_S )])
learnS.model=torch.nn.DataParallel(learnS.model, device_ids=[0,1])
learnS.load(mName_S)
log_preds, y_true = learnS.TTA()
y_true = y_true.numpy()
y_preds = np.argmax(np.exp(log_preds), axis=1)
What I ideally want to do is to train a model on multiple GPU, then save it, and later on be able to load it and predict some data via TTA.
We can’t help without seeing the full error message and your version of fastai.
hi @sgugger
this is the error it generates:
Traceback (most recent call last):
File “GpyCode.py”, line 85, in
learnS.load(m)
File “/homes/…/python3.6/site-packages/fastai/basic_train.py”, line 217, in load
state = torch.load(self.path/self.model_dir/f’{name}.pth’, map_location=device)
File “/homes/…/python3.6/site-packages/torch/serialization.py”, line 365, in load
f = open(f, ‘rb’)
FileNotFoundError: [Errno 2] No such file or directory: ‘data/LC_B_5/models/bestModel_5_S__resnet101.pth’
it says it cannot load the trained model, however this line of code :
partial(SaveModelCallback, monitor =‘val_loss’, mode =‘auto’, name = mName_S )])
which is responsible for saving the trained model is not saving the model neither generates any error message while in single GPU and multi CPU mode it saves the trained model perfectly and therefore load it without problem.
Thanks
I’m looking into training a language model on a fairly large dataset and I’m hoping to use multiple GPUs. Is it still the case that RNNs don’t work with multi-GPU?
No it works properly. It’s the SaveModelCallback that might have some problem.
Hi
I am training ULMFiT on multiple GPUs. I am facing this issue that my GPUs usage is heavily imbalanced.
System configurations:
Used fastai and torch version:
1.0.51.dev0 1.0.1.post2
Here is the code:
save_model = partial(SaveModelCallback,
monitor=‘accuracy’,
every=‘improvement’,
name=‘best_lm’)
early_stop = partial(EarlyStoppingCallback,
monitor=‘val_loss’,
min_delta=0.01,
patience=2)
lm_learner = language_model_learner(data_lm, ARCHITECTURE,
drop_mult = DROP_OUT_MULTIPLIER,
callback_fns = [early_stop, save_model])
lm_learner= lm_learner.to_parallel()
lm_learner.freeze_to(-1)
lm_learner.fit_one_cycle(cyc_len = 1,
max_lr = 0.04,
moms = (0.8, 0.7))
I have the same problem as @soorajviraat i.e. when training language models only one gpu memory is fully utilized (in the case of CNNs everything is working correctly). I’ve run some benchmarks. Here are the results:
CNN
ARCH, GPU type, Dataset
note: jeremy suggested that limiting factor might be loaders and to use torchvision to fix it. It’s an old post. I don’t know if it the problem was already addressed so I’m planning to check it and add the results to this post
Resnet34, 4xK80 GPU, FastAI PETS
- 16 : single_GPU/parallel, 0:58/1:20 min each, 4:00 min total, valid loss single/par 0.21/0.27
- 64 : single_GPU/parallel, 0:46/0:32 min each, 2:30 min total, valid loss single/par 0.21/0.21
- 256 : single_GPU/parallel/fp+p, 0:50/0:27/0:26 min each, 2:00 min total, valid loss single/par/p+fp 0.23/0.24/0.25
- 1024 does not fit to single GPU; par/par+par+fp 0:40/0:39 valid loss 0.359/0.36
Language Model (RNN)
GPU type, Dataset
4xK80 GPU, FastAI idbm
AWD_LSTM
- bs 48 singleGPU/parallel/fp/p+fp 1:24:00/1:23:00/1:54:20/1:00:00
- bs 96 singleGPU/fp/p+fp n.a./1:03:00/1:03:00
- bs 136 par 0:58:00
Transformer
- bs 36 singleGPU/par 3:12:00/3:15:00
- bs 48 singleGPU/fp/par 3:15:00/6:40:00/2:40:00
- bs 96 singleGPU/parallel/p+fp n.a./1:40:00/2:20:00
TransformerXL
- bs 96 singleGPU/parallel/p+fp n.a./1:13:00/
Classifier
AWD_LSTM
- bs 100 singleGPU/parallel 09:15/5:52
- bs 136 singleGPU/parallel/fp/p+fp total n.a./5:17/12:39/5:17
Code I used for parralelization is following
learn = cnn_learner(data, models.resnet34, metrics=error_rate)
learn.model = torch.nn.DataParallel(learn.model)
I’m not sure what’s the cause. Maybe it is working correctly (we can see a speed up in language models even though memory is limiting the batch size and thus parallelization speed up). It would be great if someone experienced could interpret the results we have here
I tried this but, in the following line:
learn.lr_find()
I got the following error:
Expected tensor for argument #1 ‘input’ to have the same device as tensor for argument #2 ‘weight’; but device 2 does not equal 0 (while checking arguments for cudnn_batch_norm)
Thank you @shaun1 this is very helpful. I am struggling with point (5), loading the state_dict into the model. When I do
model.load_state_dict(new_state_dict)
I get the error message:
‘LanguageLearner’ object has no attribute ‘load_state_dict’
Any help would be great!
Does multi gpu works with unet learner?
Like anything else, as long as you use distributed
training. DataParallel
does not work with the unet_learner
.
Hi @soorajviraat,
I have the same issue as yours with following code:
learn = language_model_learner(data_lm, AWD_LSTM, config=config, model_dir=self.model_dir.relative_to(data_lm.path), pretrained=False, **trn_args)
learn = learn.to_parallel()
Did you find any solution or reason? Is it maybe because of the famous python’s Global Interpreter Lock (GIL) that allows only one thread to run in multi threading application? I used 4 GPUs, so I though that the performance could increase at least more than 3x, but instead of 50s/epoch (with 1 GPU), I got only 37s/epoch (with 4GPUs). Here is my screenshot:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.126.02 Driver Version: 418.126.02 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2... Off | 00000000:06:00.0 Off | 0 |
| N/A 30C P0 41W / 300W | 11MiB / 32480MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla V100-SXM2... Off | 00000000:07:00.0 Off | 0 |
| N/A 30C P0 41W / 300W | 11MiB / 32480MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 Tesla V100-SXM2... Off | 00000000:0A:00.0 Off | 0 |
| N/A 30C P0 41W / 300W | 11MiB / 32480MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 Tesla V100-SXM2... Off | 00000000:0B:00.0 Off | 0 |
| N/A 27C P0 41W / 300W | 11MiB / 32480MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 4 Tesla V100-SXM2... Off | 00000000:85:00.0 Off | 0 |
| N/A 43C P0 78W / 300W | 4220MiB / 32480MiB | 56% Default |
+-------------------------------+----------------------+----------------------+
| 5 Tesla V100-SXM2... Off | 00000000:86:00.0 Off | 0 |
| N/A 43C P0 68W / 300W | 4220MiB / 32480MiB | 64% Default |
+-------------------------------+----------------------+----------------------+
| 6 Tesla V100-SXM2... Off | 00000000:89:00.0 Off | 0 |
| N/A 45C P0 72W / 300W | 4220MiB / 32480MiB | 53% Default |
+-------------------------------+----------------------+----------------------+
| 7 Tesla V100-SXM2... Off | 00000000:8A:00.0 Off | 0 |
| N/A 43C P0 141W / 300W | 26998MiB / 32480MiB | 37% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 4 21447 C python 4209MiB |
| 5 21447 C python 4209MiB |
| 6 21447 C python 4209MiB |
| 7 21447 C python 26987MiB |
+-----------------------------------------------------------------------------+
There was similar issue in https://github.com/pytorch/pytorch/issues/3917. The user mentioned the imbalanced memory usage might be caused by a memory leak in the loop of the main thread. Maybe fastai has also a memory leak in the main loop? Btw I use fastai 1.0.61 and pytorch 1.3.1
After some more searches, I found out that such imbalance GPU usage using DataParallel is an expected behaviour. It occurs because the result of all parallel computation are gathered in the main GPU as explained in the following article by Thomas Wolf from huggingface: https://medium.com/huggingface/training-larger-batches-practical-tips-on-1-gpu-multi-gpu-distributed-setups-ec88c3e51255. The Pythons GIL could further slow down the multi threading used by DataParallel. A better approach is using Distributed training with torch.distributed.launch. It uses multi processing instead of multi threading and provides balanced GPU usage. It is also mentioned in fastai : https://docs.fast.ai/distributed.html
Did anybody out there figure out how to use multiple GPUs with language_model_learner, using fast.ai version 2? Links to code examples would really help.
I am trying to use a Jupyter notebook in AzureML with 4 GPUs, but only 1 GPU gets utilized. I am trying to adjust code from https://github.com/fastai/fastbook/blob/master/10_nlp.ipynb to work with multiple GPUs.
For example, my changes look like this:
learn.model.cuda()
dls_lm.cuda()
learn.to_parallel(device_ids=[0,1,2,3])
I encounter “RuntimeError: Input and hidden tensors are not at the same device, found input tensor at cuda:1 and hidden tensor at cuda:0”
I understand that CUDA wants me to send the model and the input to all GPUs, but after trying a few different things, I still see only 1 GPU being used:
Help with fast.ai would be really appreciated.