Mixup data augmentation

So for mixup with multilabel classification we should use y_stack=False? Could you explain what that’s doing, exactly?

Indeed you should. y_stack=True stacks the not-encoded targets of the batch, the shuffled version of the batch and the lambda vector. y_stack=False will just do the linear combination directly.

1 Like

nixup is also effective for detection
“Bag of Freebies for Training Object Detection Neural Networks”
Zhi Zhang, Tong He, Hang Zhang, Zhongyuan Zhang, Junyuan Xie, Mu Li


A question: is it possible to apply mixup to the last conv-layers activation instead to the images themselves?. Thus to train only the head of a large conv-net (like resnet50) by mixing-up two 7x7x2048 or so activations?

Yes, you just have to use pytorch hooks for that.

Hi @sgugger,

Is is possible we use mixup for certain categories only? For example, we do not use the mixup feature for the last category, but use this for all other categories ?

Besides, I notice: when enable mixup feature, it seems the training loss is always bigger than valid loss. Is this normal?


Is there evidence this mixup method of data aug generalizes beyond CIFAR10 and imagenet?

You’d have to rewrite your version of the callback for that, but it’s possible (anything is possible, you just have to code it :wink: ).
Mixup is a tool to avoid ovefitting, so yes, it’s normal your training loss is bigger. Try reducing the alpha parameter.

1 Like

Very interesting!

I don’t see functionality for embedding mixing in mixup.py code in fastai, can you please share how you approach this? Are there plans to add such feature to mixup?


I’m thinking of taking a crack at this at some point. Essentially you have to pull the embedding outputs for two data points and mix them, rather than simply being able to mix the images directly. Probably could be done pretty easily with a custom layer.

1 Like

@sgugger Seems like the idea of extending this to the embedding representations (and to all other layers!) has been explored:

They pick a random layer to combine from each minibatch, which seems to improve performance beyond Mixup.

There’s also MixFeat:

Which to my understanding (and I’m still trying to fully grok the paper) is about applying noise at different layers in a mixing fashion, so rather than swapping the entire response at a given layer for a % of activations, they mix the activations together. I’m working on a layer that does the mixing from the batch, which should make it pretty efficient, in order to try and replicate their results. As far as I can tell it should be somewhat orthogonal to mixup and manifold mixup because it’s not mixing targets.

I’ll also try to take a crack at manifold mixup, assuming I can get the chance in the next few weeks since it covers the embedding case.


I had seen the manifold mixup paper, but hadn’t gotten significant improvement over simple mixup in my experiments.

I’ll look at this MixFeat paper. It sounds a lot like the Shake/Even method at a first glance.

1 Like

I’m familiar with Shake-Shake and Shake-Drop, but not Shake-Even. Can you link to the paper? I did some googling and came up dry. It might be similar, but there’s also a good chance I’m not doing it justice in my description of it. I’m still working my way through it.

It’s the same paper: https://arxiv.org/pdf/1705.07485.pdf
Though I think that’s what he calls Shake-Keep, not Shake-Even.

1 Like

Have you tried mixup for a segmentation task ?.


Really intrigued by NLP + Mixup.

Have you done any more experiments on this?

By the way: in NLP the idea of mixing within the batch should be crucial, since otherwise the length of the documents won’t match!

With images Mixup seems easier to implement: from two images you create an input, then use the model normally. In NLP, Mixup occurs after embeddings, so I imagine you modify your model so that it takes two texts as input? (Only one would be given in validation, of course.)

The current implementation does mixup within a batch. The interesting difference is that for images, mixup is handled as a callback, while for NLP (due to mixup happening at the embedding), it needs to happen as part of your forward pass. You would probably:

  1. Convert word idxs to embeddings
  2. Get your shuffled idxs and \lambda values for the batch
  3. Do the mixup (if training), run the rest of your forward pass, return a tuple of (prediction, shuffle, λ)
  4. Use shuffle and \lambda in your loss function to do the mixup on your y values
1 Like

Is mixup helpful for regression models? Has anyone tried it?


Found this paper that seem really interesting: Manifold Mixup and if I understand correctly, it is very similar to input mixup, but instead we mix at a random intermediate layer.

1 Like