Research collaboration opportunity with Leslie Smith

@sebastianruder has discussed this paper and various others in an excellent overview post:

I’m not sure that they’re using the best semi-supervised or transfer learning approaches (e.g. the kind of stuff we used in ULMFiT) and I’m also suspicious of “baselines” that use massive compute to optimize hyper-params (since that’s not how we train models in practice).

@jeremy regarding simple English, it seems like the language modeling implementation would be to start with the most frequent tokens and slowly deflate the “unknown” sentinel by extracting increasingly rare types into their own embeddings (which might increase in size during this process). This also presents an interesting opportunity to systematically size the vocabulary, rather than relying on heuristics to decide on a minimum frequency or vocabulary size.

Of course for language models that affects the difficulty of both encoding and the decoding. For language classification, it changes encoding difficulty but not decoding. For image processing, the analogous progressive resizing would be the equivalent change to encoding difficulty. For classification tasks in general, it seems like we’re exploring decoding difficulty.

1 Like

By this do you mean some sort of label embeddings were trained based on ImageNet data and publicly available or is it just usual word2vec embeddings ? :slight_smile:

But I believe it’s probably the first one since it makes more sense for learning like tSNE representation for MNIST. (If my understanding is correct we can use tSNE embeddings of MNIST and start with k clusters and add more classes from each cluster step by step). For example you start training with 7 and then at the next stage at class 1, which shouldn’t be that hard to learn after learning 7. But, then how applicable this would be in real life ? How certain can we be of the unsupervised part ?

Leslie’s approach seems easier to implement by just writing a custom sampler I suppose. Regarding to your approach do we need to dynamically change the model output as we add more classes? Such as initializing a new logistic unit for the new added class at the final layer ?

Please correct me if there is a conceptual misunderstanding.

Thanks

I wanted to point out that flaw in my implementation of Jeremy’s approach. I initialize a full network and take out some of the classes, but this turns out to perform poorly. My guess is that when you only show the network negative examples of a certain class, then the FC weights of the missing classes turn out to be pretty bad, and it takes time for the network to correct this once the positive samples are introduced.
Probably a better approach would be to implement a hierarchical approach and dynamically resize the network accordingly. There are two options here:

  1. Showing the network only certain classes each time. This uses the same classes that are in the labels, but we need to increase the size of the last FC layer gradually by the number of classes currently present. The problem here is that at each step you are introducing uninitialized weights. I assume that in this case you should freeze all of the network, train the uninitialized weights, and then resume training.
  2. Showing the network all of the data, but changing the labels of the data to fit a hierarchy. This is what @gokkulnath proposed. We can first train a network to use the superclasses, and after while, train a full network using all of the classes, but using the superclass weights as an initialization. The problem with this approach is that it is not general, it assumes the labels can be divided into superclasses. This also assumes some expert knowledge in choosing these superclasses correctly.
1 Like

Valid point. But I still have a hunch that my implementation of your method would yield poor results even with a better network, for reasons I described in the post above. The question is what is the best way to implement your approach? I personally don’t like approach (2) because it is not general. Approach (1) seems better.

I personally am not sure about both your’s and Leslie’s methods. I think that showing the network only a partial training set leads to weights that generalize less well to the validation set, and once a larger set of training images are introduced, the network has to correct this, leading to a longer training time. Whereas training on the whole training set from the beginning allows the network to converge to a general set of weights more directly.

But it’s still worth trying the approach on @radek’s CIFAR10 baseline.

@Leslie, @jeremy, @deanmark
I’ve been playing around with exactly that recently: I implemented custom sampler so that we could use less data to train quicker. My results so far indicate that this type of approach can speed up learning but it doesn’t converge to good end result. I am using CIFAR10 as a bencmark dataset and baseline is fast.ai dawnbench architecture.
e.g if our target is to find fastest time to 70-80% accuracy on CIFAR10 then small-sample training could win. But, if our target is 90+% accuracy, then small-sample training never achieves it. These are my results so far.
By the way, I modified fast.ai from here and there and my current fastest CIFAR-10 result is <11min to 93+% accuracy on my local machine (compared with <15 min to 94% acc using original fast.ai). I am trying to find this last 1% to push it over 94% :slight_smile:

@kcturgutlu, a bit related off-topic question, maybe you can help me out: ‘Wideresnet.py, line 37: multiply by 0.2

Based on the comments over the last 24 hours, I’d say “Neither” method is better than the baseline.

In hindsight, I realized a flaw in my thinking. The mini-batch size is the same throughout so the speed is the same. We are only limiting the number of training samples, which isn’t what I had in mind. I was wondering if it were possible to very quickly estimate a good solution (i.e., weights) with very few samples. One can think of this as a warm start and I still think there might be a good way to achieve it quickly…

Also, I believe @deanmark’s comments about training with a subset of classes to be correct.

I’d like to share my personal scientific process.
Initial intuition/hypothesis -> experiments -> observation, especially of contradictions to my initial intuition -> THINK -> understanding

I thank everyone for any and all of their efforts with these experiments. This isn’t necessarily over yet. We are at the thinking stage and the goal is understanding. If anyone has any profound (or less than profound) thoughts about these experiments and where to go next and where this might lead, please share. I too will think about this.

Best,
Leslie

5 Likes

Thank you too @Leslie. I had fun trying something new so no harm there. It’s nice to have the opportunity to participate in research in my spare time. Hope we’ll all gain some insight from this experience. I’m looking forward for more threads like this in the future.

P.S. the Gabor filter initialization seems to be explored. Intel caffe have a Gabor filler to initialize conv layers with Gabor filters. Also here and here.

I didn’t know that Gabor filter initialization was implemented in Caffe. I’ve wondered about this since the paper “How transferable are features in deep neural networks?” says “on the first layer they learn features similar to Gabor filters”. I’ve also wondered about initializing the weights to discrete cosine transforms (DCT), suitably normalized like msra. Has anyone done experiments with either of these initializations? Is training any faster? And is the performance any better?

Update on Progressive Resizing Idea: (i.e Increasing the Image size as training progresses)

Workflow and Experiments:

  • Plain Vannila Model with One Cycle Learning
  • Plain Vannila Model with Cosine Annealing
  • Progressive Image Resizing with One Cycle Learning
  • Progressive Image Resizing with Adaptive Reduction in Batch size with One Cycle Learning

Took the CIFAR10 dataset and trained few epochs with size 32 and kept doubling it till 128
Batch size: 256 (Was not able to use 256 GPU memory issues). - For first two experiments
Learning Policy : One Cycle Learning

i.e learn.fit( lr, 1, cycle_len=15, use_clr_beta=(10, 13.68, 0.95, 0.85), wds=1e-4)

Here are my Observations:
epochs = 15
Progressive Resizing : szs = [32,64,128]
Batch Sizes : bss= [512,256,128]

Bumps are due to Changes in Image Size/Batch Size. Varying the Image size and batch size seems to work the best among them.
When tried for longer epoch cycle there is no much gain in performance and the later methods become computationally expensive for a very meager improvement.

I have a feeling that if we elongate the number of cycles as the image resizing happens then we would be able to have better performance compared to all of them.

Experiment             Total Epochs  trn_loss	val_loss	accuracy
losses_vannila	             15   	 0.53523	0.583793	0.8001
losses_fit_usual	         15	     0.644244	0.675924	0.7635
progressive_resizing_loss	 15      0.477568	0.456707	0.8439
adaptive_batch_loss	         15      0.345881	0.35437	    0.8839

@jeremy @Leslie and others please share your thoughts !
P.S : Will update the associated notebook after i clean up the code

1 Like

Very interesting.

Questions:

  1. What architecture are you using?
  2. Why the inconsistency between the accuracy table and the plot? The plot shows the green curve as the highest final accuracy but the table says the combined image/batch resizing is best.
  3. What is the LR range? I am ignorant of the fastai implementation.
  4. What is the size of the validation data?

Another initial thought is that you are using Cifar and resizing it larger, which introduces blur. Better might be to resize imagenet down and test on the original, pristine validation images.

I am somewhat surprised by the magnitude of the improvement from the resizing and would like to sort it out in my mind.

Hi,

I have been playing about using the training phases note book and increasing the number of images in training using the one cycle provided in the notebook and using the darknet architecture on a GTX 970. I created CSV files with the different number of images (10, 100, 1000). I have a link to my notebook

As @urmas.pitsi pointed out I need to reset my model so my results don’t show anything of worth at the moment. but could be useful for people wanting a quick and dirty starting point!

2 Likes

nice!

  1. if you want to reset and start from scratch then run this cell again:
    m = Darknet([1, 2, 4, 6, 3], num_classes=10, nf=32)
    otherwise your training will just accumulate on the same model.

  2. seems that your training and validation data is overlapping, that is why we see validation_accuracy=1.0
    when you create phases, it could be that one phase trains on the data that will be in validation set for another phase. I may be wrong here, but it seems so at a glance:
    df2 = pd.read_csv(PATH/‘train_10.csv’)
    val_idx2 = get_cv_idxs(len(df2))
    df3 = pd.read_csv(PATH/‘train_100.csv’)
    val_idx3 = get_cv_idxs(len(df3))
    df4 = pd.read_csv(PATH/‘train_1000.csv’)
    val_idx4 = get_cv_idxs(len(df4))
    df5 = pd.read_csv(PATH/‘train_2000.csv’)
    val_idx5 = get_cv_idxs(len(df5))
    df6 = pd.read_csv(PATH/‘train_3000.csv’)
    val_idx6 = get_cv_idxs(len(df6))

Cheers! I thought something was a bit wrong with it! I will change my code later. and re test it!

I’ve read the wide resnet paper upon your question. I am actually not sure why activation is multiplied by 0.2 before addition in BasicBlock (which is a full pre-activation res block). That might be a good number that worked well in this case, I’m not sure :smile: Maybe we can wait for dawn bench team to reply, I am also curious.

As I’ve played around for quite extensively with this wideresnet, it seems that 0.2 is pretty good choice :slight_smile: I tried without it and various other constants, but none have the performance of 0.2.

2 Likes

So it acts as a weighted sum, I wonder what happens if we make it a learnable parameter for general resnet s :thinking:

1 Like

@Leslie
Partial sampling in training for ImageNet amongst other cool stuff.
https://arxiv.org/abs/1805.08249

Thanks to @jeremy for tweeting that. Are they also using kind of similar idea to Jeremy’s: making disjoint class groups?

Could you write me short sample how it could be done in pytorch, just line or 2 of code? I could test it right away :slight_smile:
I am still learning the pytorch way of deep learning…

I am not sure but this might maybe work, haven’t test it:

Idea is we will define weight as a learnable variable by setting requires_grad=True during initialization. Then autograd optimizer should do the rest as it will be added to computational graph I suppose.

class BasicBlock(nn.Module):
    def __init__(self, ni, nf, stride, drop_p=0.0):
        super().__init__()
        self.bn = nn.BatchNorm2d(ni)
        self.conv1 = conv_2d(ni, nf, 3, stride)
        self.conv2 = bn_relu_conv(nf, nf, 3, 1)
        self.drop = nn.Dropout(drop_p, inplace=True) if drop_p else None
        self.shortcut = conv_2d(ni, nf, 1, stride) if ni != nf else noop
        self.weight = Variable(torch.FloatTensor(1,).uniform_(0, 1), requires_grad=True)
    
    def forward(self, x):
        x2 = F.relu(self.bn(x), inplace=True)
        r = self.shortcut(x2)
        x = self.conv1(x2)
        if self.drop: x = self.drop(x)
        if (self.weight.data[0] < 0) | (self.weight.data[0] > 1):
            self.weight.data = torch.clamp(self.weight.data, 0, 1)
        x = self.conv2(x) * self.weight #0.2
        return x.add_(r)
2 Likes