So for mixup with multilabel classification we should use y_stack=False? Could you explain what that’s doing, exactly?
Indeed you should.
y_stack=True stacks the not-encoded targets of the batch, the shuffled version of the batch and the lambda vector.
y_stack=False will just do the linear combination directly.
nixup is also effective for detection
“Bag of Freebies for Training Object Detection Neural Networks”
Zhi Zhang, Tong He, Hang Zhang, Zhongyuan Zhang, Junyuan Xie, Mu Li
A question: is it possible to apply mixup to the last conv-layers activation instead to the images themselves?. Thus to train only the head of a large conv-net (like resnet50) by mixing-up two 7x7x2048 or so activations?
Yes, you just have to use pytorch hooks for that.
Is is possible we use mixup for certain categories only？ For example, we do not use the mixup feature for the last category, but use this for all other categories ?
Besides, I notice: when enable mixup feature, it seems the training loss is always bigger than valid loss. Is this normal?
Is there evidence this mixup method of data aug generalizes beyond CIFAR10 and imagenet?
You’d have to rewrite your version of the callback for that, but it’s possible (anything is possible, you just have to code it ).
Mixup is a tool to avoid ovefitting, so yes, it’s normal your training loss is bigger. Try reducing the alpha parameter.
I don’t see functionality for embedding mixing in mixup.py code in fastai, can you please share how you approach this? Are there plans to add such feature to mixup?
I’m thinking of taking a crack at this at some point. Essentially you have to pull the embedding outputs for two data points and mix them, rather than simply being able to mix the images directly. Probably could be done pretty easily with a custom layer.
@sgugger Seems like the idea of extending this to the embedding representations (and to all other layers!) has been explored:
They pick a random layer to combine from each minibatch, which seems to improve performance beyond Mixup.
There’s also MixFeat:
Which to my understanding (and I’m still trying to fully grok the paper) is about applying noise at different layers in a mixing fashion, so rather than swapping the entire response at a given layer for a % of activations, they mix the activations together. I’m working on a layer that does the mixing from the batch, which should make it pretty efficient, in order to try and replicate their results. As far as I can tell it should be somewhat orthogonal to mixup and manifold mixup because it’s not mixing targets.
I’ll also try to take a crack at manifold mixup, assuming I can get the chance in the next few weeks since it covers the embedding case.
I had seen the manifold mixup paper, but hadn’t gotten significant improvement over simple mixup in my experiments.
I’ll look at this MixFeat paper. It sounds a lot like the Shake/Even method at a first glance.
I’m familiar with Shake-Shake and Shake-Drop, but not Shake-Even. Can you link to the paper? I did some googling and came up dry. It might be similar, but there’s also a good chance I’m not doing it justice in my description of it. I’m still working my way through it.