Imagenet training project discussion

We’ve got a bit of a project going on with the SF study group which I figured I’d post about here so that others can get involved and see what’s happening if anyone else is interested.

The goal of the project is to train Imagenet to 93% accuracy as quickly as possible. If you don’t have a bunch of fast GPUs or AWS credits lying around, you can still participate by trying to train Imagenet with 128x128 images to 84% accuracy as quickly as possible - insights there are likely to be transferable.

Here are some ideas we’re working on:

  • Use half precision float (only helps on Volta architecture - e.g AWS P3)
  • Multi-GPU training with Nvidia’s NCCL library or with Pytorch’s nn.DataParallel
  • Use Smith’s new 1cycle and cyclical momentum (in fastai now as use_clr_beta)
  • Better data augmentation (see separate GoogleNet augmentation project thread)
  • TTA every n epochs

And some experiments we plan to run:

  • concat pooling
  • Larger bs / lr
  • Other architectures: dual path net, xception, inception4, inception resnet, yolov3 backbone
  • sz 128->224->288
  • stochastic weight averaging
  • adam (mom,0.9/0.99) / use_wd_sched
  • snapshot ensembling
  • turn off wd / aug for last few epochs

Let me know if anyone’s interested in more info about any of these, or has any ideas or wants to try anything themselves.


I have a spare 1080 TI in my practicum server, there are two in total but one I am using for our own experiments but on other I can leave a script running for ImageNet for sure!

Are we going to build ideas mentioned above on top of a script readily available or from scratch ?

Sounds very exciting thanks for sharing :slight_smile:


The idea is to build anything that doesn’t already exist in fastai into the library, and then have a script for training flexibly.

There seems to be multiple releases of imagenet dataset.
Is “ImageNet Fall 2011 release” right dataset for object classification benchmark?

Does anyone know ballpark range of training time for imagenet using single 1080Ti?

And I have an idea to I want to try.
During Kaggle CDiscount challege(also had large dataset - 15 million images with 180x180 res), I tried a following method to speed up the training:

1. Train normally for a few epoch.
2. Freeze a first one (or two) layer group and train. 
       - In my case, I could double the batch size and the training time for each epoch was halved.
3. Unfreeze all layers time to time. 
        (Or it might be better if gradients from unfrozen layers are accumulated and backproped to froze layers time to time?)  
4. Repeat 2-3

The idea is that first/second layer groups take up quite memory/computation, but weight values change very little after some epochs(and especially with pretrained weights), so they don’t need to be updated every weights for all batches.
Since I kept change training parameters during competition, I don’t know whether this actually worked in terms of training time or accuracy.

I would like to confirm this approach while applying all fancy features as time allows.
Actually I don’t have much background in ML/DL, so let me know if this approach doesn’t make much sense!

BTW, Accurate, Large Minibatch SGD:
Training ImageNet in 1 Hour
paper from FB mentions ‘gradual warmup’ of learning rate. And interestingly it looks like CLR was doing similar already.

(Edit: I used Xception in the competition since it took almost half time to train than inception4, inception resnet though it had slight less accuracy. It may be just because I used a bad implementation, but it took forever to train DPN)

In terms of architecture experiments, what are your thoughts on incorporating object segmentation feature on fastai. It would include using Mask RCNN model. But that would actually not be training on Imagenet data (which is not theme of this thread, sorry to deviate), but on COCO.

I guess fastai only includes localization using bounding boxes for now, right? Btw it was Kerem’s idea as we were discussing about this yesterday.

It would be great. If you do go down this path, please do create a new thread for it and at-mention me.

Sure, thanks :slight_smile:

Was that with pytorch? I’m finding DPN slow. Inception-resnet-2 is looking hopeful. Planning to try Xception next. I’m planning to create a preact-resnet and yolov3 backbone for pytorch to try do - I’ll create a new thread soon for that in case someone else wants to have a go at it.

I’m getting about 12 hours for rn50 on 8*GPU on volta with half precision. I believe pascal is 3x slower. So that sounds like maybe 12 days on a 1080ti. Possibly with 1cycle learning you can make it 5x faster.

No, that’s the full set. You want the competition subset of 1000 categories, which is called ILSVRC2017 (for classification it’s been the same since 2012 AFAIK, so you can also look for ILSVRC2012).

It was Keras.

It seems to take way longer than I expected :frowning:
Anyway, I’ll give it a shot.

The actual competition seems to be classification + localization.
Does ‘top-5’ accuracy in papers usually mean ‘classification’ accuracy or ‘classification + localization’ accuracy?

And the goal of ‘93%’ accuracy in this project means top-5 classification accuracy, right?

There are different competitions. One of them is classification only.


Hi @jeremy,

Congratulations on the amazing results!
It’s very exciting!!!
I have been reading the source code you published on github. Thanks a lot!
I have a couple questions at the moment:

  • Have other architectures been trained as well, such as darknet53, dpn and so on?
  • What about their accuracy and training time?
  • Do you plan to share the pre-trained models?

Thank you so much!

I hope I got the right idea to bump this topic as I want to keep discussing Imagenet training (which I am working on right now)

I’ll write down some of my notes here, maybe they can help others:

#have to create and label valid and train separately because methods are different
valid = ImageList.from_df(val_df, path=path+’/val’,suffix=’.JPEG’).split_none(
train = ImageList.from_folder(path+’/train’).split_none().label_from_folder()
#combine valid and train into train
train.valid = valid.train

Although I wouldn’t use the above for Imagenet since from_folder seems not adapted.

  • This is what I have now so far for get_data(), using the folder structure and csv files as provided by Kaggle:

def get_data(path, size, bs, workers):

    tfms = ([
    ], [])

    # only keep the first word of the second column for  labels
    val_df = pd.read_csv('LOC_val_solution.csv', header='infer')
    val_df.PredictionString = val_df.PredictionString.str.split().str.get(0)

    train_df = pd.read_csv('LOC_train_solution.csv', header='infer')
    train_df.PredictionString = train_df.PredictionString.str.split().str.get(0)
    # add label as folder name in training data
    train_df.ImageId = train_df.PredictionString+'/'+train_df.ImageId

    valid = ImageList.from_df(val_df, path=path+'/val',suffix='.JPEG')
    train= ImageList.from_df(train_df, path=path+'/train',suffix='.JPEG')

    lls = ItemLists(path, train, valid).label_from_df().transform(
            tfms, size=size).presize(size, scale=(0.25, 1.0))

    return lls.databunch(bs=bs, num_workers=workers).normalize(imagenet_stats)
1 Like

Question: I first ran 1 single epoch and it took 27 minutes on 8*V100.
Then the epochs started taking 90 seconds. Is that drastic change likely due to CUDNN benchmarking or was there something else going on with my salamander instance?

Edit that was with smaller 128px images, 224px images take about 3 minutes per epoch

Edit2: My guess is this wasnt benchmarking. When I run multiple epochs, the first one is a few seconds slower, I think that would be benchmarking.

One weird tip: when running fastai distributed through fastai.launch, I sometimes stop whatever training I’m running in the middle of it (CTRL+Z). This then causes a “RuntimeError: Address already in use” when I try to run things again.

What works is closing my console (I use Cmder), relogging in through SSH, and trying again… If there’s a better way, let me know!

Edit: If pressing CTRL+C you can interrupt a training run without the problem above