I am trying to run
fastai trainer on multiple (x4) GPUs on the same machine. For this, I use the
distrib_ctx context manager.
I’m launching the script using
accelerate launch train.py
Also, I’m using the following
Unfortunately, the script doesn’t work.
AttributeError: 'Sequential' object has no attribute 'distrib_ctx'
I’ve created the
vision_learner and it works fine on a single GPU. However, as soon as I try to start it on multiple GPUs, it fails.
Could you please give me some tips about multi-GPU training in
fastai library? Would be great if I can keep training in
fastai, and not switch to another trainer to scale up.
What is your experience? How do you train
Learner in case of multiple GPUs? Also, the version I use is
I’m going to need a whole heck lot more information What are your imports like? What’s your learner like? Are you doing
from fastai.distributed import *? Please try include as much information as possible for me to help you as I currently can’t provide anything truly helpful yet
Hi @muellerzr I was trying to run my code on multi gpu on my local machine so the python file (without notebook launcher part) is running fine but there is issue while I am trying to run the code in jupyter notebook. I have added the code for reference and the issue with it. Also I am using fastai 2.7.12
from fastai.vision.all import *
from accelerate import notebook_launcher
from fastai.distributed import *
from accelerate.utils import write_basic_config
base_path = Path('data/')
all_images = get_image_files(base_path)
dls = DataBlock(
blocks = (ImageBlock, CategoryBlock),
get_items = get_image_files,
splitter = GrandparentSplitter(train_name='training', valid_name='validation'),
get_y = parent_label,
).dataloaders(base_path, bs = 128)
learn = vision_learner(dls, 'efficientnet_b2', metrics=error_rate).to_fp16()
And this is the error I am getting:
ValueError Traceback (most recent call last)
Cell In, line 4
2 with learn.distrib_ctx():
----> 4 notebook_launcher(train, num_processes=2)
File ~/fastai_projects/fastai_env/lib/python3.10/site-packages/accelerate/launchers.py:123, in notebook_launcher(function, args, num_processes, mixed_precision, use_port)
116 raise ValueError(
117 "To launch a multi-GPU training from your notebook, the `Accelerator` should only be initialized "
118 "inside your training function. Restart your notebook and make sure no cells initializes an "
122 if torch.cuda.is_initialized():
--> 123 raise ValueError(
124 "To launch a multi-GPU training from your notebook, you need to avoid running any instruction "
125 "using `torch.cuda` in any cell. Restart your notebook and make sure no cells use any CUDA "
129 # torch.distributed will expect a few environment variable to be here. We set the ones common to each
130 # process here (the other ones will be set be the launcher).
131 with patch_environment(
132 world_size=num_processes, master_addr="127.0.01", master_port=use_port, mixed_precision=mixed_precision
ValueError: To launch a multi-GPU training from your notebook, you need to avoid running any instruction using `torch.cuda` in any cell. Restart your notebook and make sure no cells use any CUDA function.
Can you help me with this?
The error is quite clear. You can’t run any code that initializes cuda. Please see the docs for examples of how to do so, you’ll notice that we don’t create the learner or dataloaders because this will initialize CUDA: fastai - Notebook Launcher examples
Thanks that worked like a charm so everything that has possibility of triggering .to(‘cuda’) have to be shifted in a complete new function which has to be used via notebook launcher