The Learner.distrib_ctx doesn't work on v2.7.11

devforfu · February 26, 2023, 11:04pm

Hi,

I am trying to run fastai trainer on multiple (x4) GPUs on the same machine. For this, I use the distrib_ctx context manager.

with trainer.distrib_ctx():
    trainer.fine_tune(
        args.epochs, 
        base_lr=args.base_lr, 
        freeze_epochs=args.freeze_epochs
    )

I’m launching the script using accelerate.

accelerate launch train.py

Also, I’m using the following accelerate configuration.

compute_environment: LOCAL_MACHINE
deepspeed_config: {}
distributed_type: MULTI_GPU
downcast_bf16: 'no'
dynamo_backend: 'NO'
fsdp_config: {}
gpu_ids: all
machine_rank: 0
main_training_function: main
megatron_lm_config: {}
mixed_precision: fp16
num_machines: 1
num_processes: 4
rdzv_backend: static
same_network: true
use_cpu: false

Unfortunately, the script doesn’t work.

AttributeError: 'Sequential' object has no attribute 'distrib_ctx'

I’ve created the vision_learner and it works fine on a single GPU. However, as soon as I try to start it on multiple GPUs, it fails.

Could you please give me some tips about multi-GPU training in fastai library? Would be great if I can keep training in fastai, and not switch to another trainer to scale up.

What is your experience? How do you train Learner in case of multiple GPUs? Also, the version I use is 2.7.11.

muellerzr · February 27, 2023, 1:08am

I’m going to need a whole heck lot more information What are your imports like? What’s your learner like? Are you doing from fastai.distributed import *? Please try include as much information as possible for me to help you as I currently can’t provide anything truly helpful yet

Aniket_Thomas · April 17, 2023, 1:37pm

Hi @muellerzr I was trying to run my code on multi gpu on my local machine so the python file (without notebook launcher part) is running fine but there is issue while I am trying to run the code in jupyter notebook. I have added the code for reference and the issue with it. Also I am using fastai 2.7.12

from fastai.vision.all import *
from accelerate import notebook_launcher
from fastai.distributed import *
from accelerate.utils import write_basic_config
write_basic_config()
set_seed(99, True)

base_path = Path('data/')
all_images = get_image_files(base_path)
dls = DataBlock(
                blocks = (ImageBlock, CategoryBlock),
                get_items = get_image_files,
                splitter =  GrandparentSplitter(train_name='training', valid_name='validation'),
                get_y = parent_label,
                item_tfms=[Resize(224, method='squish')],
  ).dataloaders(base_path, bs = 128)
learn = vision_learner(dls, 'efficientnet_b2', metrics=error_rate).to_fp16()
def train():
    with learn.distrib_ctx(): 
        learn.fine_tune(4)
notebook_launcher(train, num_processes=2)

And this is the error I am getting:

ValueError                                Traceback (most recent call last)
Cell In[7], line 4
      2     with learn.distrib_ctx(): 
      3         learn.fine_tune(4)
----> 4 notebook_launcher(train, num_processes=2)

File ~/fastai_projects/fastai_env/lib/python3.10/site-packages/accelerate/launchers.py:123, in notebook_launcher(function, args, num_processes, mixed_precision, use_port)
    116     raise ValueError(
    117         "To launch a multi-GPU training from your notebook, the `Accelerator` should only be initialized "
    118         "inside your training function. Restart your notebook and make sure no cells initializes an "
    119         "`Accelerator`."
    120     )
    122 if torch.cuda.is_initialized():
--> 123     raise ValueError(
    124         "To launch a multi-GPU training from your notebook, you need to avoid running any instruction "
    125         "using `torch.cuda` in any cell. Restart your notebook and make sure no cells use any CUDA "
    126         "function."
    127     )
    129 # torch.distributed will expect a few environment variable to be here. We set the ones common to each
    130 # process here (the other ones will be set be the launcher).
    131 with patch_environment(
    132     world_size=num_processes, master_addr="127.0.01", master_port=use_port, mixed_precision=mixed_precision
    133 ):

ValueError: To launch a multi-GPU training from your notebook, you need to avoid running any instruction using `torch.cuda` in any cell. Restart your notebook and make sure no cells use any CUDA function.

Can you help me with this?

muellerzr · April 17, 2023, 1:47pm

The error is quite clear. You can’t run any code that initializes cuda. Please see the docs for examples of how to do so, you’ll notice that we don’t create the learner or dataloaders because this will initialize CUDA: Notebook Launcher examples – fastai

Aniket_Thomas · April 17, 2023, 1:49pm

Thanks that worked like a charm so everything that has possibility of triggering .to(‘cuda’) have to be shifted in a complete new function which has to be used via notebook launcher

muellerzr · April 17, 2023, 1:52pm

Yep. Read more about it here: Launching Multi-GPU Training from a Jupyter Environment