Mixup data augmentation

wdhorton · December 18, 2018, 10:16pm

So for mixup with multilabel classification we should use y_stack=False? Could you explain what that’s doing, exactly?

sgugger · December 18, 2018, 10:26pm

Indeed you should. y_stack=True stacks the not-encoded targets of the batch, the shuffled version of the batch and the lambda vector. y_stack=False will just do the linear combination directly.

dror · February 14, 2019, 4:18am

nixup is also effective for detection
“Bag of Freebies for Training Object Detection Neural Networks”
Zhi Zhang, Tong He, Hang Zhang, Zhongyuan Zhang, Junyuan Xie, Mu Li

dror · February 14, 2019, 4:20am

A question: is it possible to apply mixup to the last conv-layers activation instead to the images themselves?. Thus to train only the head of a large conv-net (like resnet50) by mixing-up two 7x7x2048 or so activations?

sgugger · February 14, 2019, 2:30pm

Yes, you just have to use pytorch hooks for that.

cooli46 · February 15, 2019, 9:23am

Hi @sgugger,

Is is possible we use mixup for certain categories only？ For example, we do not use the mixup feature for the last category, but use this for all other categories ?

Besides, I notice: when enable mixup feature, it seems the training loss is always bigger than valid loss. Is this normal?

Thanks.

kechan · February 16, 2019, 11:15pm

Is there evidence this mixup method of data aug generalizes beyond CIFAR10 and imagenet?

sgugger · February 17, 2019, 2:56pm

You’d have to rewrite your version of the callback for that, but it’s possible (anything is possible, you just have to code it ).
Mixup is a tool to avoid ovefitting, so yes, it’s normal your training loss is bigger. Try reducing the alpha parameter.

jeremy · February 26, 2019, 5:34pm

Very interesting!

nosound · March 6, 2019, 3:02pm

I don’t see functionality for embedding mixing in mixup.py code in fastai, can you please share how you approach this? Are there plans to add such feature to mixup?

Even · March 9, 2019, 5:59am

I’m thinking of taking a crack at this at some point. Essentially you have to pull the embedding outputs for two data points and mix them, rather than simply being able to mix the images directly. Probably could be done pretty easily with a custom layer.

Even · March 25, 2019, 5:38pm

@sgugger Seems like the idea of extending this to the embedding representations (and to all other layers!) has been explored:

They pick a random layer to combine from each minibatch, which seems to improve performance beyond Mixup.

There’s also MixFeat:

Which to my understanding (and I’m still trying to fully grok the paper) is about applying noise at different layers in a mixing fashion, so rather than swapping the entire response at a given layer for a % of activations, they mix the activations together. I’m working on a layer that does the mixing from the batch, which should make it pretty efficient, in order to try and replicate their results. As far as I can tell it should be somewhat orthogonal to mixup and manifold mixup because it’s not mixing targets.

I’ll also try to take a crack at manifold mixup, assuming I can get the chance in the next few weeks since it covers the embedding case.

sgugger · March 25, 2019, 6:06pm

I had seen the manifold mixup paper, but hadn’t gotten significant improvement over simple mixup in my experiments.

I’ll look at this MixFeat paper. It sounds a lot like the Shake/Even method at a first glance.

Even · March 25, 2019, 6:44pm

I’m familiar with Shake-Shake and Shake-Drop, but not Shake-Even. Can you link to the paper? I did some googling and came up dry. It might be similar, but there’s also a good chance I’m not doing it justice in my description of it. I’m still working my way through it.

sgugger · March 25, 2019, 6:47pm

It’s the same paper: https://arxiv.org/pdf/1705.07485.pdf
Though I think that’s what he calls Shake-Keep, not Shake-Even.

harikrishnanrajeev · April 29, 2019, 4:35pm

Have you tried mixup for a segmentation task ?.

Pablo · May 8, 2019, 3:11pm

Really intrigued by NLP + Mixup.

Have you done any more experiments on this?

By the way: in NLP the idea of mixing within the batch should be crucial, since otherwise the length of the documents won’t match!

With images Mixup seems easier to implement: from two images you create an input, then use the model normally. In NLP, Mixup occurs after embeddings, so I imagine you modify your model so that it takes two texts as input? (Only one would be given in validation, of course.)

KarlH · May 8, 2019, 5:49pm

The current implementation does mixup within a batch. The interesting difference is that for images, mixup is handled as a callback, while for NLP (due to mixup happening at the embedding), it needs to happen as part of your forward pass. You would probably:

Convert word idxs to embeddings
Get your shuffled idxs and \lambda values for the batch
Do the mixup (if training), run the rest of your forward pass, return a tuple of (prediction, shuffle, λ)
Use shuffle and \lambda in your loss function to do the mixup on your y values

pooya_drv · May 10, 2019, 8:02am

Is mixup helpful for regression models? Has anyone tried it?

etremblay · May 25, 2019, 5:10pm

Found this paper that seem really interesting: Manifold Mixup and if I understand correctly, it is very similar to input mixup, but instead we mix at a random intermediate layer.