Unable to reproduce DAWNBench ImageNet Results (April)


I was hoping to reproduce fast.ai’s DAWNBench ImageNet results. I first wanted to try reproducing the results from April (https://www.fast.ai/2018/04/30/dawnbench-fastai/) because that is for a single 8-GPU instance. Unfortunately, I was not able to do so.

I started from scratch on a box running Ubuntu Linux. There are 8 V100 GPUs, lots of memory, CPUs, and disk space, and nothing else running on the machine. After installing CUDA+cuDNN and Anaconda, I installed PyTorch and Torchvision from source with the commit hashes given in https://github.com/stanford-futuredata/dawn-bench-entries/blob/master/ImageNet/train/fastai_pytorch.json. I cloned the imagenet-fast repository at: https://github.com/fastai/imagenet-fast/blob/c4b225555e333a1a2702d2b291b5082bfa6d6a0a/imagenet_nv/main.py. It was not specified whether to run train_imagenet_nv.py or main.py, so I tried both.

When I run python -m multiproc main.py -a resnet50 --lr 0.40 --epochs 45 --small [this is the set of the arguments specified in the above JSON file] I got the error “ValueError: Error initializing torch.distributed using file:// rendezvous: path missing”. I tried passing in a different argument to --dist-url (such as file:///sync.file) but then I got 'ValueError: Error initializing torch.distributed using file:// rendezvous: rank parameter missing". The error is in torch/distributed/rendezvous.py; I can provide the full stack trace if needed.

Running train_imagenet_nv.py required the fast.ai library, so I checked out the master branch of the fastai repository on April 21, 2018 (it was not specified what version of fastai libraries were used, so I just got this commit because it was the same day as the timestamp of commit submitted to DAWNBench) and installed that. I then tried the suggested command from the imagenet_nv/README, which is python fastai_imagenet.py $IMAGENET_DIR -a resnet18 --save-dir $SAVE_DIR [of course, replacing the $DIR placeholders with the appropriate directories]. Again I got the same errors from torch.distributed.

As I installed PyTorch from scratch from source from the specified commit, this should not be any versioning issue. There are no instructions for what to pass to --dist-url, so I assumed the default would be correct, but perhaps if I simply pass in something else to this argument it will just work; I am not sure what I am supposed to pass in for it to work correctly, though.

I am not particularly wedded to reproducing this specific result - I would be open to a newer version of fast ImageNet code. However, I want to use only one machine with 8 V100 GPUs (so, not the more recent DAWNBench submission which uses 16 of these) and I would prefer that most of the main modifications are of the efficiency variety (as are most of the ones from the April result) rather things like than data augmentation, changing the model, or cycling learning rates (I certainly have no issues with that stuff, but I have specific reasons for not wanting to use it right now).

This post from @jeremy New coordinate transforms pipeline suggests that training ImageNet can be done 2.5 hours or less presumably still on a single box, with some improvements over the April DAWNBench result. I would love to know more about this result, or just how to reproduce the DAWNBench result itself (or anything similar with working code that can get me to train ImageNet in < 3 hours with a single p3.16xlarge).

Thank you!