Distributed fastai doesn't work on JarvisLabs

devforfu · September 16, 2022, 9:10pm

Hi everyone,

I tried to run fastai library on JarvisLabs with 2x GPUs using its distributed functionality as follows.

# train.py

... # some setup code

learn = get_learner(get_dls(10))
learn.unfreeze()
with learn.distrib_ctx():
    learn.fit_one_cycle(1, slice(0.001, 0.01))

I launched it as:

accelerate launch train.py

However, the script failed with an error. I don’t have a stack trace. (I’ll try to publish it a bit later, as I disabled the system and cannot rerun it right now.) But it was something about socket connection issues.

I think that the problem could be unrelated to fastai library itself. Probably something with instance configuration. However, I decided to ask this question on the forum in case if someone encountered any accelerate/fastai related issues on that platform.

Did anyone try to run distributed context there? Or maybe some other cloud provider? And in general, from your point of view, what would be the best platform to run training on x2-x4 GPUs?

Thank you!

muellerzr · September 17, 2022, 2:43am

Not much i can do off that, but it sounds like a config error indeed.

Is it two GPUs on one system? Or do you know if its two GPUs on different machines?

I can help much more if we have a trace though!

VishnuSubramanian · September 17, 2022, 10:03am

Hi @devforfu

I was able to run fastai program using multi-gpus without any issues.
The steps I followed are

pip uninstall fastai
pip install fastai
pip install accelerate
accelerate config

Made a copy of the fastai examples from here

And launched it using

accelerate launch fa_distrip.py

It ran smoothly. Can you try the above steps and let me know if you run into issues?

devforfu · September 17, 2022, 12:08pm

@muellerzr @VishnuSubramanian Thank you for such a prompt response!

Actually, I tried to re-run my script again, and this time I got a bit different error:

RuntimeError: Expected to have finished reduction in the prior iteration before 
starting a new one. This error indicates that your module has parameters that were 
not used in producing loss. You can enable unused parameter detection by passing 
the keyword argument `find_unused_parameters=True` to 
`torch.nn.parallel.DistributedDataParallel`, and by making sure all `forward` 
function outputs participate in calculating loss. 

If you already have done the above, then the distributed data parallel module wasn't 
able to locate the output tensors in the return value of your module's `forward` function. 
Please include the loss function and the structure of the return value of `forward` of your
module when reporting this issue (e.g. list, dict, iterable).

Parameter indices which did not receive grad for rank 0: 501 502 503. In addition, 
you can set the environment variable TORCH_DISTRIBUTED_DEBUG to either 
INFO or DETAIL to print out information about which particular parameters did not 
receive gradient on this rank as part of this error.

The snippet I use works on a single GPU, but doesn’t work when I try to launch it via accelerate. It looks as follows (with some parts omitted):

def get_datablock(valid_pct=0.1, subset_pct=0.1):    
    # configure and create data block
    return DataBlock(...)
    
def get_dls(batch_size, num_workers=4, prefetch_factor=4, **kwargs):
    return get_datablock(**kwargs).dataloaders(
        DATA_DIR, bs=batch_size, num_workers=num_workers,
        prefetch_factor=prefetch_factor
    )

def get_learner(dataloaders):
    model = create_model(...)
    learn = Learner(dataloaders, model, loss_func=bce_loss).to_fp16()
    learn.freeze()
    return learn

learn = get_learner(get_dls(10))
learn.unfreeze()
with learn.distrib_ctx():
    learn.fit_one_cycle(1, slice(0.001, 0.01))

The same code works in Jupyter, but returns the error above when running on two GPUs. Also, I tried to follow the steps that @VishnuSubramanian recommended, and it works on the fastai’s example. However, my snippet fails. So, it seems like the problem is with my setup, and not with the accelerate or the cloud platform. I’ll try to debug it more carefully.

muellerzr · September 17, 2022, 1:55pm

Try enabling the find_unused_parameters (its an arg to learn.distrib_ctx)

wgpubs · February 1, 2023, 8:51pm

UPDATED SOLUTION (courtesy of Zach):

from accelerate.utils import DistributedDataParallelKwargs

kwargs = DistributedDataParallelKwargs(find_unused_parameters=True)

with learn.distrib_ctx(kwarg_handelers=[kwargs]):
    ....

csaroff · November 19, 2023, 4:23am

Really appreciate you posting this.
Minor correction: It’s spelled kwargs_handlers.

from accelerate.utils import DistributedDataParallelKwargs

kwargs = DistributedDataParallelKwargs(find_unused_parameters=True)

with learn.distrib_ctx(kwargs_handlers=[kwargs]):
    ....