Research collaboration opportunity with Leslie Smith

Nubbinsonfire · May 23, 2018, 9:05pm

Cheers! I thought something was a bit wrong with it! I will change my code later. and re test it!

kcturgutlu · May 24, 2018, 4:09am

I’ve read the wide resnet paper upon your question. I am actually not sure why activation is multiplied by 0.2 before addition in BasicBlock (which is a full pre-activation res block). That might be a good number that worked well in this case, I’m not sure Maybe we can wait for dawn bench team to reply, I am also curious.

urmas.pitsi · May 24, 2018, 5:01am

As I’ve played around for quite extensively with this wideresnet, it seems that 0.2 is pretty good choice I tried without it and various other constants, but none have the performance of 0.2.

kcturgutlu · May 24, 2018, 5:14am

So it acts as a weighted sum, I wonder what happens if we make it a learnable parameter for general resnet s

urmas.pitsi · May 24, 2018, 5:53am

@Leslie
Partial sampling in training for ImageNet amongst other cool stuff.
https://arxiv.org/abs/1805.08249

Thanks to @jeremy for tweeting that. Are they also using kind of similar idea to Jeremy’s: making disjoint class groups?

urmas.pitsi · May 24, 2018, 5:58am

Could you write me short sample how it could be done in pytorch, just line or 2 of code? I could test it right away
I am still learning the pytorch way of deep learning…

kcturgutlu · May 24, 2018, 6:13am

I am not sure but this might maybe work, haven’t test it:

Idea is we will define weight as a learnable variable by setting requires_grad=True during initialization. Then autograd optimizer should do the rest as it will be added to computational graph I suppose.

class BasicBlock(nn.Module):
    def __init__(self, ni, nf, stride, drop_p=0.0):
        super().__init__()
        self.bn = nn.BatchNorm2d(ni)
        self.conv1 = conv_2d(ni, nf, 3, stride)
        self.conv2 = bn_relu_conv(nf, nf, 3, 1)
        self.drop = nn.Dropout(drop_p, inplace=True) if drop_p else None
        self.shortcut = conv_2d(ni, nf, 1, stride) if ni != nf else noop
        self.weight = Variable(torch.FloatTensor(1,).uniform_(0, 1), requires_grad=True)
    
    def forward(self, x):
        x2 = F.relu(self.bn(x), inplace=True)
        r = self.shortcut(x2)
        x = self.conv1(x2)
        if self.drop: x = self.drop(x)
        if (self.weight.data[0] < 0) | (self.weight.data[0] > 1):
            self.weight.data = torch.clamp(self.weight.data, 0, 1)
        x = self.conv2(x) * self.weight #0.2
        return x.add_(r)

gokkulnath · May 24, 2018, 9:11am

I am using the Same architecture @deanmark used. (i.e Resnet2([16, 32, 64, 128, 256], 10, 0.2) )
Sorry Found a bug when plotting the results. Here is the updated version of plots:

download_fixex.png898×1136 103 KB
LR : i had used LR varying from 1e-2 to 1e-3 i guess
I am using the entire test set as validation set.
Notebook is made available : Link

Thats True! Fastai DAWN Benchmark uses Progressive Resizing i believe. I am not sure how well it had contributed to the whole of training process. Will try to setup the same experiments with Imagenet and check it out.
P.S : I find the training times to increase with the resizing approach. It almost takes 12 minutes on p2.xlarge (AWS) for the normal Training procedure and Takes almost 3x i.e 35-36 minutes for the resizing approaches ?
What could be the reason for the increase in time ?

Bottleneck between CPU and GPU ?
Under Utilization of GPU in Earlier Stages due to small batch size. if thats the case then adaptive batch should train faster (Actually it does but it is very negligible. Say 2-3 mins faster)

~Gokkul

Leslie · May 24, 2018, 10:45am

I don’t understand what

represents. It is a resnet but with how many layers?
3. I’m surprised by the low learning rates but it does explain why I’m not seeing the characteristic 1cycle shape. For resnset and 1cycle, I’d expect LR to go from 0.1 to 3 down to 1e-3. Also, I’d expect a final accuracy near 90% for the baseline, not 80%.

I believe the increase in training time is because you are resizing larger, not smaller. I understand progressive resizing as a coarse to fine approach - start with a quick training with a coarse set of training images, transfer the weights, then train with the original images.

jeremy · May 24, 2018, 4:58pm

There’s no point increasing size to be larger than the original image. In this case, cifar10 is 32px images, so that’s the largest you should go. (Also, cifar10 are too small to be usefully down-sized, so as @Leslie said, you should try this on imagenet or similar instead.)

jeremy · May 24, 2018, 4:58pm

That trick is from the inception-v4 paper.

jeremy · May 24, 2018, 4:59pm

Just w2v, like we did in the DeVISE lesson.

Borz · May 27, 2018, 7:06am

Hi everyone. I made a baseline notebook using MNIST. To Leslie and Jeremy: this doesn’t contain any relevant experiments. But it has served as a good warm up.

I expect to have a notebook of experiments on CIFAR-10 out, if not by Monday, then by the end of the week – and ImageNet later if things look good (or for work on larger images).

Navigation help for anyone looking at the notebook:

Data setup
Architecture setup
Loss Function
Training (with plots)
Testing
More training & testing (with plots)
Closing notes (with table of results)

A couple pictures:

A pretrained ResNet18 training only the classifier head (1st epoch):

pt_res_learner-1-epoch.png836×429 20.7 KB
A custom CNN in its last 4 of 7 training epochs:

custom_learner-4-epoch.png842×429 39.7 KB

Aside from the research, this has taught me alot about pytorch and fastai’s internals. Apparently there’s a built-in callback to save the best version of a model which I look forward to testing out.

PranY · May 27, 2018, 6:10pm

Hi Dean,

Can you please share your thoughts around this notebook, I planned this to account for all the possibilities in the approaches shared by Leslie and Jeremy. I will plug it into the notebook shared by Radek to test DAWNBench.

I also have few questions around the updated approach you had shared but I’ll PM you those questions. Eagerly awaiting your reply.

Thanks

deanmark · May 28, 2018, 7:43am

Hi @PranY,

Dynamic resizing is already built into the fastai library, so better just use that. Look at the “dl1/lesson2-image_models.ipynb” notebook for an example. On top of that you can use my code to select which classes to use. I would use this code to change the data object on the fly:

cls_list = [0,4,3,5]    
learn.set_data(get_data(sz, cls_list))   
def get_data(sz, classes):
    tfms = tfms_from_model(f_model, sz, aug_tfms=transforms_top_down, max_zoom=1.05)
    return ImageClassifierData.from_csv(PATH, 'train-jpg', label_csv, tfms=tfms,
                    suffix='.jpg', val_idxs=val_idxs, test_name='test-jpg', partial_train_classes=classes)

The partial_train_classes argument is enabled with my code. You also have a max_train_per_class argument to restrict the number of training images per class as per Leslie’s idea. Every time you finish several epochs, you can change the data object to use different sizes and different classes. This approach allows you to use any previous fastai code with minimal change.

fmsunyh · July 6, 2018, 3:04pm

Oh, I really love these ideas. Interested.I’m looking for a quikly training way and come here. Thanks all of you guys, I will try it,try it.

Leslie · July 6, 2018, 4:09pm

Thank you for your interest. The current research thread is “Cyclical Layer Learning Rates (a research question)”. Feel free to participate in that conversation.

fmsunyh · July 6, 2018, 4:23pm

I’ve been use jeremy’s idea to train model before, including my friends. But I don’t know that many friends are discussing this issue here. So late to be here. I will try your idea next monday. Go to sleep…

Thanks.

ranakj · July 7, 2018, 10:44pm

Is Cyclical Layer Learning Rates (a research question) still the current research thread?

bny6613 · July 8, 2018, 11:35am

It is in the separate topic Cyclical Layer Learning Rates (a research question)