Research collaboration opportunity with Leslie Smith

Cheers! I thought something was a bit wrong with it! I will change my code later. and re test it!

I’ve read the wide resnet paper upon your question. I am actually not sure why activation is multiplied by 0.2 before addition in BasicBlock (which is a full pre-activation res block). That might be a good number that worked well in this case, I’m not sure :smile: Maybe we can wait for dawn bench team to reply, I am also curious.

As I’ve played around for quite extensively with this wideresnet, it seems that 0.2 is pretty good choice :slight_smile: I tried without it and various other constants, but none have the performance of 0.2.


So it acts as a weighted sum, I wonder what happens if we make it a learnable parameter for general resnet s :thinking:

1 Like

Partial sampling in training for ImageNet amongst other cool stuff.

Thanks to @jeremy for tweeting that. Are they also using kind of similar idea to Jeremy’s: making disjoint class groups?

Could you write me short sample how it could be done in pytorch, just line or 2 of code? I could test it right away :slight_smile:
I am still learning the pytorch way of deep learning…

I am not sure but this might maybe work, haven’t test it:

Idea is we will define weight as a learnable variable by setting requires_grad=True during initialization. Then autograd optimizer should do the rest as it will be added to computational graph I suppose.

class BasicBlock(nn.Module):
    def __init__(self, ni, nf, stride, drop_p=0.0):
        super().__init__() = nn.BatchNorm2d(ni)
        self.conv1 = conv_2d(ni, nf, 3, stride)
        self.conv2 = bn_relu_conv(nf, nf, 3, 1)
        self.drop = nn.Dropout(drop_p, inplace=True) if drop_p else None
        self.shortcut = conv_2d(ni, nf, 1, stride) if ni != nf else noop
        self.weight = Variable(torch.FloatTensor(1,).uniform_(0, 1), requires_grad=True)
    def forward(self, x):
        x2 = F.relu(, inplace=True)
        r = self.shortcut(x2)
        x = self.conv1(x2)
        if self.drop: x = self.drop(x)
        if ([0] < 0) | ([0] > 1):
   = torch.clamp(, 0, 1)
        x = self.conv2(x) * self.weight #0.2
        return x.add_(r)
  1. I am using the Same architecture @deanmark used. (i.e Resnet2([16, 32, 64, 128, 256], 10, 0.2) )
  2. Sorry Found a bug when plotting the results. Here is the updated version of plots:
  3. LR : i had used LR varying from 1e-2 to 1e-3 i guess
  4. I am using the entire test set as validation set.
    Notebook is made available : Link

Thats True! Fastai DAWN Benchmark uses Progressive Resizing i believe. I am not sure how well it had contributed to the whole of training process. Will try to setup the same experiments with Imagenet and check it out.
P.S : I find the training times to increase with the resizing approach. It almost takes 12 minutes on p2.xlarge (AWS) for the normal Training procedure and Takes almost 3x i.e 35-36 minutes for the resizing approaches ?
What could be the reason for the increase in time ?

  • Bottleneck between CPU and GPU ?
  • Under Utilization of GPU in Earlier Stages due to small batch size. if thats the case then adaptive batch should train faster (Actually it does but it is very negligible. Say 2-3 mins faster)


  1. I don’t understand what

represents. It is a resnet but with how many layers?
3. I’m surprised by the low learning rates but it does explain why I’m not seeing the characteristic 1cycle shape. For resnset and 1cycle, I’d expect LR to go from 0.1 to 3 down to 1e-3. Also, I’d expect a final accuracy near 90% for the baseline, not 80%.

I believe the increase in training time is because you are resizing larger, not smaller. I understand progressive resizing as a coarse to fine approach - start with a quick training with a coarse set of training images, transfer the weights, then train with the original images.

There’s no point increasing size to be larger than the original image. In this case, cifar10 is 32px images, so that’s the largest you should go. (Also, cifar10 are too small to be usefully down-sized, so as @Leslie said, you should try this on imagenet or similar instead.)


That trick is from the inception-v4 paper.


Just w2v, like we did in the DeVISE lesson.


Hi everyone. I made a baseline notebook using MNIST. To Leslie and Jeremy: this doesn’t contain any relevant experiments. But it has served as a good warm up.

I expect to have a notebook of experiments on CIFAR-10 out, if not by Monday, then by the end of the week – and ImageNet later if things look good (or for work on larger images).

Navigation help for anyone looking at the notebook:

  1. Data setup
  2. Architecture setup
  3. Loss Function
  4. Training (with plots)
  5. Testing
  6. More training & testing (with plots)
  7. Closing notes (with table of results)

A couple pictures:

Aside from the research, this has taught me alot about pytorch and fastai’s internals. Apparently there’s a built-in callback to save the best version of a model which I look forward to testing out.


Hi Dean,

Can you please share your thoughts around this notebook, I planned this to account for all the possibilities in the approaches shared by Leslie and Jeremy. I will plug it into the notebook shared by Radek to test DAWNBench.

I also have few questions around the updated approach you had shared but I’ll PM you those questions. Eagerly awaiting your reply.


Hi @PranY,

Dynamic resizing is already built into the fastai library, so better just use that. Look at the “dl1/lesson2-image_models.ipynb” notebook for an example. On top of that you can use my code to select which classes to use. I would use this code to change the data object on the fly:

cls_list = [0,4,3,5]    
learn.set_data(get_data(sz, cls_list))   
def get_data(sz, classes):
    tfms = tfms_from_model(f_model, sz, aug_tfms=transforms_top_down, max_zoom=1.05)
    return ImageClassifierData.from_csv(PATH, 'train-jpg', label_csv, tfms=tfms,
                    suffix='.jpg', val_idxs=val_idxs, test_name='test-jpg', partial_train_classes=classes)

The partial_train_classes argument is enabled with my code. You also have a max_train_per_class argument to restrict the number of training images per class as per Leslie’s idea. Every time you finish several epochs, you can change the data object to use different sizes and different classes. This approach allows you to use any previous fastai code with minimal change.


Oh, I really love these ideas. Interested.I’m looking for a quikly training way and come here. Thanks all of you guys, I will try it,try it.:smiley:

Thank you for your interest. The current research thread is “Cyclical Layer Learning Rates (a research question)”. Feel free to participate in that conversation.

1 Like

I’ve been use jeremy’s idea to train model before, including my friends. But I don’t know that many friends are discussing this issue here. So late to be here. I will try your idea next monday. Go to sleep…


Is Cyclical Layer Learning Rates (a research question) still the current research thread?

It is in the separate topic Cyclical Layer Learning Rates (a research question)