@sebastianruder has discussed this paper and various others in an excellent overview post:
I’m not sure that they’re using the best semi-supervised or transfer learning approaches (e.g. the kind of stuff we used in ULMFiT) and I’m also suspicious of “baselines” that use massive compute to optimize hyper-params (since that’s not how we train models in practice).
@jeremy regarding simple English, it seems like the language modeling implementation would be to start with the most frequent tokens and slowly deflate the “unknown” sentinel by extracting increasingly rare types into their own embeddings (which might increase in size during this process). This also presents an interesting opportunity to systematically size the vocabulary, rather than relying on heuristics to decide on a minimum frequency or vocabulary size.
Of course for language models that affects the difficulty of both encoding and the decoding. For language classification, it changes encoding difficulty but not decoding. For image processing, the analogous progressive resizing would be the equivalent change to encoding difficulty. For classification tasks in general, it seems like we’re exploring decoding difficulty.
By this do you mean some sort of label embeddings were trained based on ImageNet data and publicly available or is it just usual word2vec embeddings ?
But I believe it’s probably the first one since it makes more sense for learning like tSNE representation for MNIST. (If my understanding is correct we can use tSNE embeddings of MNIST and start with k clusters and add more classes from each cluster step by step). For example you start training with 7 and then at the next stage at class 1, which shouldn’t be that hard to learn after learning 7. But, then how applicable this would be in real life ? How certain can we be of the unsupervised part ?
Leslie’s approach seems easier to implement by just writing a custom sampler I suppose. Regarding to your approach do we need to dynamically change the model output as we add more classes? Such as initializing a new logistic unit for the new added class at the final layer ?
Please correct me if there is a conceptual misunderstanding.
I wanted to point out that flaw in my implementation of Jeremy’s approach. I initialize a full network and take out some of the classes, but this turns out to perform poorly. My guess is that when you only show the network negative examples of a certain class, then the FC weights of the missing classes turn out to be pretty bad, and it takes time for the network to correct this once the positive samples are introduced.
Probably a better approach would be to implement a hierarchical approach and dynamically resize the network accordingly. There are two options here:
Showing the network only certain classes each time. This uses the same classes that are in the labels, but we need to increase the size of the last FC layer gradually by the number of classes currently present. The problem here is that at each step you are introducing uninitialized weights. I assume that in this case you should freeze all of the network, train the uninitialized weights, and then resume training.
Showing the network all of the data, but changing the labels of the data to fit a hierarchy. This is what @gokkulnath proposed. We can first train a network to use the superclasses, and after while, train a full network using all of the classes, but using the superclass weights as an initialization. The problem with this approach is that it is not general, it assumes the labels can be divided into superclasses. This also assumes some expert knowledge in choosing these superclasses correctly.
Valid point. But I still have a hunch that my implementation of your method would yield poor results even with a better network, for reasons I described in the post above. The question is what is the best way to implement your approach? I personally don’t like approach (2) because it is not general. Approach (1) seems better.
I personally am not sure about both your’s and Leslie’s methods. I think that showing the network only a partial training set leads to weights that generalize less well to the validation set, and once a larger set of training images are introduced, the network has to correct this, leading to a longer training time. Whereas training on the whole training set from the beginning allows the network to converge to a general set of weights more directly.
But it’s still worth trying the approach on @radek’s CIFAR10 baseline.
@Leslie, @jeremy, @deanmark
I’ve been playing around with exactly that recently: I implemented custom sampler so that we could use less data to train quicker. My results so far indicate that this type of approach can speed up learning but it doesn’t converge to good end result. I am using CIFAR10 as a bencmark dataset and baseline is fast.ai dawnbench architecture.
e.g if our target is to find fastest time to 70-80% accuracy on CIFAR10 then small-sample training could win. But, if our target is 90+% accuracy, then small-sample training never achieves it. These are my results so far.
By the way, I modified fast.ai from here and there and my current fastest CIFAR-10 result is <11min to 93+% accuracy on my local machine (compared with <15 min to 94% acc using original fast.ai). I am trying to find this last 1% to push it over 94%
Based on the comments over the last 24 hours, I’d say “Neither” method is better than the baseline.
In hindsight, I realized a flaw in my thinking. The mini-batch size is the same throughout so the speed is the same. We are only limiting the number of training samples, which isn’t what I had in mind. I was wondering if it were possible to very quickly estimate a good solution (i.e., weights) with very few samples. One can think of this as a warm start and I still think there might be a good way to achieve it quickly…
Also, I believe @deanmark’s comments about training with a subset of classes to be correct.
I’d like to share my personal scientific process.
Initial intuition/hypothesis -> experiments -> observation, especially of contradictions to my initial intuition -> THINK -> understanding
I thank everyone for any and all of their efforts with these experiments. This isn’t necessarily over yet. We are at the thinking stage and the goal is understanding. If anyone has any profound (or less than profound) thoughts about these experiments and where to go next and where this might lead, please share. I too will think about this.
Thank you too @Leslie. I had fun trying something new so no harm there. It’s nice to have the opportunity to participate in research in my spare time. Hope we’ll all gain some insight from this experience. I’m looking forward for more threads like this in the future.
P.S. the Gabor filter initialization seems to be explored. Intel caffe have a Gabor filler to initialize conv layers with Gabor filters. Also here and here.
I didn’t know that Gabor filter initialization was implemented in Caffe. I’ve wondered about this since the paper “How transferable are features in deep neural networks?” says “on the first layer they learn features similar to Gabor filters”. I’ve also wondered about initializing the weights to discrete cosine transforms (DCT), suitably normalized like msra. Has anyone done experiments with either of these initializations? Is training any faster? And is the performance any better?
Update on Progressive Resizing Idea: (i.e Increasing the Image size as training progresses)
Workflow and Experiments:
Plain Vannila Model with One Cycle Learning
Plain Vannila Model with Cosine Annealing
Progressive Image Resizing with One Cycle Learning
Progressive Image Resizing with Adaptive Reduction in Batch size with One Cycle Learning
Took the CIFAR10 dataset and trained few epochs with size 32 and kept doubling it till 128
Batch size: 256 (Was not able to use 256 GPU memory issues). - For first two experiments
Learning Policy : One Cycle Learning
Bumps are due to Changes in Image Size/Batch Size. Varying the Image size and batch size seems to work the best among them.
When tried for longer epoch cycle there is no much gain in performance and the later methods become computationally expensive for a very meager improvement.
I have a feeling that if we elongate the number of cycles as the image resizing happens then we would be able to have better performance compared to all of them.
Why the inconsistency between the accuracy table and the plot? The plot shows the green curve as the highest final accuracy but the table says the combined image/batch resizing is best.
What is the LR range? I am ignorant of the fastai implementation.
What is the size of the validation data?
Another initial thought is that you are using Cifar and resizing it larger, which introduces blur. Better might be to resize imagenet down and test on the original, pristine validation images.
I am somewhat surprised by the magnitude of the improvement from the resizing and would like to sort it out in my mind.
I have been playing about using the training phases note book and increasing the number of images in training using the one cycle provided in the notebook and using the darknet architecture on a GTX 970. I created CSV files with the different number of images (10, 100, 1000). I have a link to my notebook
As @urmas.pitsi pointed out I need to reset my model so my results don’t show anything of worth at the moment. but could be useful for people wanting a quick and dirty starting point!
if you want to reset and start from scratch then run this cell again:
m = Darknet([1, 2, 4, 6, 3], num_classes=10, nf=32)
otherwise your training will just accumulate on the same model.
seems that your training and validation data is overlapping, that is why we see validation_accuracy=1.0
when you create phases, it could be that one phase trains on the data that will be in validation set for another phase. I may be wrong here, but it seems so at a glance:
df2 = pd.read_csv(PATH/‘train_10.csv’)
val_idx2 = get_cv_idxs(len(df2))
df3 = pd.read_csv(PATH/‘train_100.csv’)
val_idx3 = get_cv_idxs(len(df3))
df4 = pd.read_csv(PATH/‘train_1000.csv’)
val_idx4 = get_cv_idxs(len(df4))
df5 = pd.read_csv(PATH/‘train_2000.csv’)
val_idx5 = get_cv_idxs(len(df5))
df6 = pd.read_csv(PATH/‘train_3000.csv’)
val_idx6 = get_cv_idxs(len(df6))
I’ve read the wide resnet paper upon your question. I am actually not sure why activation is multiplied by 0.2 before addition in BasicBlock (which is a full pre-activation res block). That might be a good number that worked well in this case, I’m not sure Maybe we can wait for dawn bench team to reply, I am also curious.
As I’ve played around for quite extensively with this wideresnet, it seems that 0.2 is pretty good choice I tried without it and various other constants, but none have the performance of 0.2.
Could you write me short sample how it could be done in pytorch, just line or 2 of code? I could test it right away
I am still learning the pytorch way of deep learning…
I am not sure but this might maybe work, haven’t test it:
Idea is we will define weight as a learnable variable by setting requires_grad=True during initialization. Then autograd optimizer should do the rest as it will be added to computational graph I suppose.
class BasicBlock(nn.Module):
def __init__(self, ni, nf, stride, drop_p=0.0):
super().__init__()
self.bn = nn.BatchNorm2d(ni)
self.conv1 = conv_2d(ni, nf, 3, stride)
self.conv2 = bn_relu_conv(nf, nf, 3, 1)
self.drop = nn.Dropout(drop_p, inplace=True) if drop_p else None
self.shortcut = conv_2d(ni, nf, 1, stride) if ni != nf else noop
self.weight = Variable(torch.FloatTensor(1,).uniform_(0, 1), requires_grad=True)
def forward(self, x):
x2 = F.relu(self.bn(x), inplace=True)
r = self.shortcut(x2)
x = self.conv1(x2)
if self.drop: x = self.drop(x)
if (self.weight.data[0] < 0) | (self.weight.data[0] > 1):
self.weight.data = torch.clamp(self.weight.data, 0, 1)
x = self.conv2(x) * self.weight #0.2
return x.add_(r)