Yes, there should be 2 softmaxes, or something similar.
You’ve described my thought process much more elegantly than I could have. I think @jeremy is exactly right, and that it should be a mixture of two softmaxes in the targets, although I think his earlier comment regarding the fact that this is drawn from a beta distribution is also important as most samples will have a dominant class.
Combining the images and loss in a linear way may also be helping with regularization. In the majority of cases one of the classes is dominant, and relative to it’s signal the other class is adding some noise to the input and targets sampled from the other classes. In order to get the correct scores the signal from this secondary class has to be much stronger which to me intuitively sounds like it would make the network more robust and better able to differentiate between classes. Maybe noise is the wrong term, but you get the idea.
It’s such an interesting paper/concept. One of my favourites of the year.
Interestingly in Amazon’s latest paper titled “Bag of Tricks for Image Classification with Convolutional Neural Networks” they used mixup to get an additonal percentage point on their CV models
So for mixup with multilabel classification we should use y_stack=False? Could you explain what that’s doing, exactly?
Indeed you should.
y_stack=True stacks the not-encoded targets of the batch, the shuffled version of the batch and the lambda vector.
y_stack=False will just do the linear combination directly.
nixup is also effective for detection
“Bag of Freebies for Training Object Detection Neural Networks”
Zhi Zhang, Tong He, Hang Zhang, Zhongyuan Zhang, Junyuan Xie, Mu Li
A question: is it possible to apply mixup to the last conv-layers activation instead to the images themselves?. Thus to train only the head of a large conv-net (like resnet50) by mixing-up two 7x7x2048 or so activations?
Yes, you just have to use pytorch hooks for that.
Is is possible we use mixup for certain categories only？ For example, we do not use the mixup feature for the last category, but use this for all other categories ?
Besides, I notice: when enable mixup feature, it seems the training loss is always bigger than valid loss. Is this normal?
Is there evidence this mixup method of data aug generalizes beyond CIFAR10 and imagenet?
You’d have to rewrite your version of the callback for that, but it’s possible (anything is possible, you just have to code it ).
Mixup is a tool to avoid ovefitting, so yes, it’s normal your training loss is bigger. Try reducing the alpha parameter.
I don’t see functionality for embedding mixing in mixup.py code in fastai, can you please share how you approach this? Are there plans to add such feature to mixup?
I’m thinking of taking a crack at this at some point. Essentially you have to pull the embedding outputs for two data points and mix them, rather than simply being able to mix the images directly. Probably could be done pretty easily with a custom layer.
@sgugger Seems like the idea of extending this to the embedding representations (and to all other layers!) has been explored:
They pick a random layer to combine from each minibatch, which seems to improve performance beyond Mixup.
There’s also MixFeat:
Which to my understanding (and I’m still trying to fully grok the paper) is about applying noise at different layers in a mixing fashion, so rather than swapping the entire response at a given layer for a % of activations, they mix the activations together. I’m working on a layer that does the mixing from the batch, which should make it pretty efficient, in order to try and replicate their results. As far as I can tell it should be somewhat orthogonal to mixup and manifold mixup because it’s not mixing targets.
I’ll also try to take a crack at manifold mixup, assuming I can get the chance in the next few weeks since it covers the embedding case.
I had seen the manifold mixup paper, but hadn’t gotten significant improvement over simple mixup in my experiments.
I’ll look at this MixFeat paper. It sounds a lot like the Shake/Even method at a first glance.
I’m familiar with Shake-Shake and Shake-Drop, but not Shake-Even. Can you link to the paper? I did some googling and came up dry. It might be similar, but there’s also a good chance I’m not doing it justice in my description of it. I’m still working my way through it.
It’s the same paper: https://arxiv.org/pdf/1705.07485.pdf
Though I think that’s what he calls Shake-Keep, not Shake-Even.
Have you tried mixup for a segmentation task ?.
Really intrigued by NLP + Mixup.
Have you done any more experiments on this?
By the way: in NLP the idea of mixing within the batch should be crucial, since otherwise the length of the documents won’t match!
With images Mixup seems easier to implement: from two images you create an input, then use the model normally. In NLP, Mixup occurs after embeddings, so I imagine you modify your model so that it takes two texts as input? (Only one would be given in validation, of course.)