Hello, I’m currently experimenting with the fastai distributed training implementation and encountered a few weird behaviors. I did execute write_basic_config()
before executing the script.
I couldn’t find anything similar online, but if someone could help me or point me to a relevant resource I would be more than thankful.
Problems:
-
When running the code from the tutorial I always get a KeyError after the training is finished.
(See 1. Log) -
Also, I did notice that the rank0_first() function does not seem to behave as intended. When I add a print statement to a wrapper function the print statement, and probably the untar function as well, gets called twice instead of once.
(See 2. Log)
Setup:
- AWS g3.8xlarge Instance with CUDA 11.6 and two Tesla M60 GPUs
- PyTorch version: 1.13.0+cu116
- fastai version: 2.7.10
- accelerate version: 0.14.0
Original Tutorial Code:
from fastai.vision.all import *
from fastai.distributed import *
from fastai.vision.models.xresnet import *
path = rank0_first(untar_data, URLs.IMAGEWOOF_320)
dls = DataBlock(
blocks=(ImageBlock, CategoryBlock),
splitter=GrandparentSplitter(valid_name='val'),
get_items=get_image_files, get_y=parent_label,
item_tfms=[RandomResizedCrop(160), FlipItem(0.5)],
batch_tfms=Normalize.from_stats(*imagenet_stats)
).dataloaders(path, path=path, bs=64)
learn = Learner(dls, xresnet50(n_out=10), metrics=[accuracy,top_k_accuracy]).to_fp16()
with learn.distrib_ctx():
learn.fit_flat_cos(2, 1e-3, cbs=MixUp(0.1))
Original Tutorial Code Log:
root@d10ec97eb0d9:~/mAgIcAoI/Api# accelerate launch fastai-tutorial.py
epoch train_loss valid_loss accuracy top_k_accuracy time
0 2.194511 2.145407 0.247583 0.752672 00:44
1 2.037519 1.823475 0.364122 0.847328 00:43
Traceback (most recent call last):
File "/usr/local/bin/accelerate", line 8, in <module>
sys.exit(main())
File "/usr/local/lib/python3.8/dist-packages/accelerate/commands/accelerate_cli.py", line 45, in main
args.func(args)
File "/usr/local/lib/python3.8/dist-packages/accelerate/commands/launch.py", line 1063, in launch_command
multi_gpu_launcher(args)
File "/usr/local/lib/python3.8/dist-packages/accelerate/commands/launch.py", line 642, in multi_gpu_launcher
console.print_exception(suppress=[__file__], show_locals=False)
File "/usr/lib/python3.8/contextlib.py", line 120, in __exit__
next(self.gen)
File "/usr/local/lib/python3.8/dist-packages/accelerate/utils/other.py", line 102, in patch_environment
del os.environ[key.upper()]
File "/usr/lib/python3.8/os.py", line 691, in __delitem__
raise KeyError(key) from None
KeyError: 'NO_PROXY'
Modified untar function
def test():
print("downloading")
return untar_data(URLs.IMAGEWOOF_320)
path = rank0_first(test)
Modified untar function Log:
root@d10ec97eb0d9:~/mAgIcAoI/Api# accelerate launch fastai-tutorial.py
downloading
downloading
epoch train_loss valid_loss accuracy top_k_accuracy time
0 2.205160 2.120142 0.249873 0.773282 00:44
1 2.034135 1.860407 0.331552 0.832316 00:43
Traceback (most recent call last):
File "/usr/local/bin/accelerate", line 8, in <module>
sys.exit(main())
File "/usr/local/lib/python3.8/dist-packages/accelerate/commands/accelerate_cli.py", line 45, in main
args.func(args)
File "/usr/local/lib/python3.8/dist-packages/accelerate/commands/launch.py", line 1063, in launch_command
multi_gpu_launcher(args)
File "/usr/local/lib/python3.8/dist-packages/accelerate/commands/launch.py", line 642, in multi_gpu_launcher
console.print_exception(suppress=[__file__], show_locals=False)
File "/usr/lib/python3.8/contextlib.py", line 120, in __exit__
next(self.gen)
File "/usr/local/lib/python3.8/dist-packages/accelerate/utils/other.py", line 102, in patch_environment
del os.environ[key.upper()]
File "/usr/lib/python3.8/os.py", line 691, in __delitem__
raise KeyError(key) from None
KeyError: 'NO_PROXY'