Weird behaviour with the distributed training tutorial

Hello, I’m currently experimenting with the fastai distributed training implementation and encountered a few weird behaviors. I did execute write_basic_config() before executing the script.

I couldn’t find anything similar online, but if someone could help me or point me to a relevant resource I would be more than thankful.

Problems:

  1. When running the code from the tutorial I always get a KeyError after the training is finished.
    (See 1. Log)

  2. Also, I did notice that the rank0_first() function does not seem to behave as intended. When I add a print statement to a wrapper function the print statement, and probably the untar function as well, gets called twice instead of once.
    (See 2. Log)

Setup:

  • AWS g3.8xlarge Instance with CUDA 11.6 and two Tesla M60 GPUs
  • PyTorch version: 1.13.0+cu116
  • fastai version: 2.7.10
  • accelerate version: 0.14.0

Original Tutorial Code:

from fastai.vision.all import *
from fastai.distributed import *
from fastai.vision.models.xresnet import *

path = rank0_first(untar_data, URLs.IMAGEWOOF_320)
dls = DataBlock(
    blocks=(ImageBlock, CategoryBlock),
    splitter=GrandparentSplitter(valid_name='val'),
    get_items=get_image_files, get_y=parent_label,
    item_tfms=[RandomResizedCrop(160), FlipItem(0.5)],
    batch_tfms=Normalize.from_stats(*imagenet_stats)
).dataloaders(path, path=path, bs=64)

learn = Learner(dls, xresnet50(n_out=10), metrics=[accuracy,top_k_accuracy]).to_fp16()
with learn.distrib_ctx():
    learn.fit_flat_cos(2, 1e-3, cbs=MixUp(0.1))

Original Tutorial Code Log:

root@d10ec97eb0d9:~/mAgIcAoI/Api# accelerate launch fastai-tutorial.py 
epoch     train_loss  valid_loss  accuracy  top_k_accuracy  time    
0         2.194511    2.145407    0.247583  0.752672        00:44                                                                                
1         2.037519    1.823475    0.364122  0.847328        00:43                                                                                
Traceback (most recent call last):
  File "/usr/local/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.8/dist-packages/accelerate/commands/accelerate_cli.py", line 45, in main
    args.func(args)
  File "/usr/local/lib/python3.8/dist-packages/accelerate/commands/launch.py", line 1063, in launch_command
    multi_gpu_launcher(args)
  File "/usr/local/lib/python3.8/dist-packages/accelerate/commands/launch.py", line 642, in multi_gpu_launcher
    console.print_exception(suppress=[__file__], show_locals=False)
  File "/usr/lib/python3.8/contextlib.py", line 120, in __exit__
    next(self.gen)
  File "/usr/local/lib/python3.8/dist-packages/accelerate/utils/other.py", line 102, in patch_environment
    del os.environ[key.upper()]
  File "/usr/lib/python3.8/os.py", line 691, in __delitem__
    raise KeyError(key) from None
KeyError: 'NO_PROXY'

Modified untar function

def test():
    print("downloading")
    return untar_data(URLs.IMAGEWOOF_320)

path = rank0_first(test)

Modified untar function Log:

root@d10ec97eb0d9:~/mAgIcAoI/Api# accelerate launch fastai-tutorial.py 
downloading
downloading
epoch     train_loss  valid_loss  accuracy  top_k_accuracy  time    
0         2.205160    2.120142    0.249873  0.773282        00:44                                                                                
1         2.034135    1.860407    0.331552  0.832316        00:43                                                                                
Traceback (most recent call last):
  File "/usr/local/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.8/dist-packages/accelerate/commands/accelerate_cli.py", line 45, in main
    args.func(args)
  File "/usr/local/lib/python3.8/dist-packages/accelerate/commands/launch.py", line 1063, in launch_command
    multi_gpu_launcher(args)
  File "/usr/local/lib/python3.8/dist-packages/accelerate/commands/launch.py", line 642, in multi_gpu_launcher
    console.print_exception(suppress=[__file__], show_locals=False)
  File "/usr/lib/python3.8/contextlib.py", line 120, in __exit__
    next(self.gen)
  File "/usr/local/lib/python3.8/dist-packages/accelerate/utils/other.py", line 102, in patch_environment
    del os.environ[key.upper()]
  File "/usr/lib/python3.8/os.py", line 691, in __delitem__
    raise KeyError(key) from None
KeyError: 'NO_PROXY'

Could you open an issue with this on the accelerate repo? Its the first time ive seen a problem like that.