Hi @orangelmx , just out of curiosity, with all other config being equal, how long does it take to train on only one of these GPUs ? or do you see improvements in single GPU numbers if you increase the batch size even more (since it’s only using 5/24 GB in the 2 GPU config. I’m guessing it would probably be able to use bs=512 with some (marked?) improvement in time?
Hi @mike.moloch,
I was interested in how to solve the issue using fastai with multiple GPUs because with large images, you need more than just 16g of GPU memory; some of the competitions on kaggle or other sites, you need to have more than 24g, at least that has been my understanding on the forums where the people that wins the competitions can train with more GPs RAm with better image resolution, are the ones on the first places.
Like my last competition on Great Coral Reef, people in the first place were using GPU’s with 48g of memory; the image size was 3400x3400.
For me, Fastai is one of the best frameworks out there, the other is PyTorch lightning, but from my point of view is missing some things that fastai gives you; it has a couple of things like version saving, like yolov5, but not much more, so I prefer to use fastai.
Now for the sample code I have, it is not a big deal as every image is only 96x96 pixels… but for bigger images, you need lots of memory…
Because training on minimal batch size, like 2,4 or 8, you cant calculate the gradients to well. I have found that using 32 batch size is the best; in some of my experiments, I get better results with 16 or 32 than with 64, 128, of 256 batch size, if you look on the web, you will see that it is not good to train with big batch size.
So conclusion, the 49G of memory is for image size, not incrementing the batch size. I hope this helps. If you have any other questions, please ask.
Nice work – thank you for sharing!
Thanks for a detailed answer @orangelmx , I didn’t know about batch size increase having a negative impact on training outcomes.
Batch size also works as a hyperparameter. So increasing/decreasing the BS while keeping the other hyper parameters constant can impact training in a positive/negative way.
While tweaking batch_size
we may also have to choose a suitable optimizer, learning rate, and scheduler.
You can check this paper ResnetStrikesBack, where the authors used a crazy BS of 2048 and created SOTA
results.
I have also written a summary here.
That’s great that nn.DataParallel
worked well for you. It normally doesn’t improve training time much at all, and sometimes makes it worse. That’s why I suggested using nn.DistributedDataParallel
, which generally gives close to linear speedup.
Hi, This is very frustrating. (I am running on Mac)
I can’t even install fastai
I have been using colab and trained my model. Then I opened a new one in Colab to do this part fast.ai Live - Lesson 2 - YouTube
when I add from fastai.vision.all import *
I get
ModuleNotFoundError Traceback (most recent call last)
in () ----> 1 from fastai.vision.all import *
/usr/local/lib/python3.7/dist-packages/fastai/vision/all.py in () 1 from . import models 2 from …basics import * ----> 3 from …callback.all import * 4 from .augment import * 5 from .core import *
ModuleNotFoundError: No module named ‘fastai.callback.all’; ‘fastai.callback’ is not a package
I have activated the extra GPU too, still the same issue
Why would this happen randomly?
I am not a pro in colab, but I would recommend you to check here. Using Colab | Practical Deep Learning for Coders
Doing it on a daily basis and so can you by following these steps:
PS: it’s still a pain, that’s why I’m working on it
I personally think it’s a combination of both:
model on the server:
- bigger model
- more compute
model on device:
- works offline
- no latency
- privacy
This is what I would do:
SeeMe.ai Deployment. | Practical Deep Learning for Coders
Have you tried using fastai’s support for DataParallel
?
Great summary
Thanks for that link btw. I was just trying to see what the throughput on the v100 would be if we were to near-saturate the memory (hit as near as 95-99%) … increasing bs was just to hit the memory hard; truth be told I didn’t really think much about the actual training process, but I’ve learned something about the impact of batch size on architecture performance. This is the power of such an amazing community
Thanks for writing this up.
Hi Pomo,
In general, the way to rediscover this yourself is to search the video’s autogenerated Transcript. Click the [three dots] in the bottom right and [Show Transcript].
Then hit <CTRL-F> to find the text you need.
Thats how I found this mamba ref.
Hi @ilovescience,
thanks, but are you referring to this?
learn.model = torch.nn.DataParallel(learn.model, device_ids=[0, 1])
or this
Learner.to_parallel(device_ids =[0,1])
I hit this problem going through the binder example from the 2020 course. Sorry to say I didn’t solve it. Since 2022 changed to HuggingFace, so I did I (but only half way through it before sleep overcame me).
p.s. a sometime useful principle… If you can’t solve the problem, change the problem!