Mixup data augmentation

I think that’s what Sylvain is suggesting. Treat each variable independently and do the mix on the embeddings. For continuous variables I think you just mix them directly.

How is the ground truth label when mixup two images from different classes?
And what does “the target” mean, please?

1 Like

@Gopeth see the paper:


1 Like

Thank U,I’ll read the paper first.

reading the code of Mixup calback, I see that the booleans:


are hard-coded, and for my particular classification problem where my target is (bs, number_classes) I can’t understand this line:

if self.stack_y:
            new_target = torch.cat([last_target[:,None].float(), y1[:,None].float(), lambd[:,None].float()], 1)

Computing sizes:
cat( [ (bs, n_clases, 1) (bs, n_classes, 1) (bs, 1)] ) does not work for me.
Why it is not just the weighted sum?

Ok, I understood something, I should use y_stack =True for Classifications and for multilabel should be y_stack=False.

It is missing some .float() calls, submited a PR.


I’ve been reading and testing the mixup model this evening but are a bit confused. In the example in the fastai documentation it trains a model with and without mixup and compares the result. Using mixup seems to make the loss larger and accuracy lower for the same number of epocs. It does the same for my image-dataset. But the paper shows otherwise. Do I have to change other regularisations like lowering dropout and weight decay to get the benefit?

I am blown away by mixup’s elegance and simplicity.

However, the implementation as described glossed over an important issue, which perplexes me. In a classification problem, labels are discrete. But a convex combination of discrete labels is not a discrete label. Thus, applying mixup transforms a classification problem into a regression problem, since it maps discrete target labels to a continuous space. Have I misunderstood something?

Instead of forming a convex combination of a pair of one-hot encoded labels


it is more natural (to me) to form convex combinations of their softmax probabilities, then assign labels by thresholding (using empirical thresholds for each class). In this way, we could handle classification problems where the target is allowed to have multiple labels.

Suppose the labels can have N classes. If the softmax probabilities for examples i and k are

\{P_{i1}, P_{i2}, ...P_{iN}\}, \{P_{k1}, P_{k2}, ...P_{kN}\}

Applying the mixup mapping, we get softmax probabilities

\{P_{mixup}\} = \{P_1, P_2, ..P_N\} = \\ \{\lambda P_{i1} + (1-\lambda)P_{k1}, \lambda P_{i2} + (1-\lambda)P_{k2}, \lambda P_{iN} + (1-\lambda)P_{kN}\}



is drawn from the beta distribution


and, according to the empirical studies presented in the paper

\alpha = 0.4

Next, we would determine the label(s)


of the mixup example from its softmax probabilities


by applying the appropriate thresholding.

@Even and @jeremy is this is what you meant in your previous comments?


Cant the one hot encoded labels(only 0 and 1) already be interpreted as the softmax probability targets? Apply the convex combination will yield something like 0.3,0.7 like in the very post which seems fine to me if im not missing anything?

@Even Can you explain " In my head the softmax should be creating an exponential relationship for the translation"? I’m just interested with respects to image classification mostly.

Anyone gotten good results with the technique? This reminds of Bengio’s talk where to says that neural nets project all the inputs onto a linear space where the inputs all lie on a flat plane and a combination of them yields another valid input : https://youtu.be/Yr1mOzC93xs?t=975 . If anyone can explain what he is saying it would help us understand why mix up works better.

Yes, there should be 2 softmaxes, or something similar.

Hey Joseph,

You’ve described my thought process much more elegantly than I could have. I think @jeremy is exactly right, and that it should be a mixture of two softmaxes in the targets, although I think his earlier comment regarding the fact that this is drawn from a beta distribution is also important as most samples will have a dominant class.

Combining the images and loss in a linear way may also be helping with regularization. In the majority of cases one of the classes is dominant, and relative to it’s signal the other class is adding some noise to the input and targets sampled from the other classes. In order to get the correct scores the signal from this secondary class has to be much stronger which to me intuitively sounds like it would make the network more robust and better able to differentiate between classes. Maybe noise is the wrong term, but you get the idea.

It’s such an interesting paper/concept. One of my favourites of the year.


Interestingly in Amazon’s latest paper titled “Bag of Tricks for Image Classification with Convolutional Neural Networks” they used mixup to get an additonal percentage point on their CV models


So for mixup with multilabel classification we should use y_stack=False? Could you explain what that’s doing, exactly?

Indeed you should. y_stack=True stacks the not-encoded targets of the batch, the shuffled version of the batch and the lambda vector. y_stack=False will just do the linear combination directly.

1 Like

nixup is also effective for detection
“Bag of Freebies for Training Object Detection Neural Networks”
Zhi Zhang, Tong He, Hang Zhang, Zhongyuan Zhang, Junyuan Xie, Mu Li


A question: is it possible to apply mixup to the last conv-layers activation instead to the images themselves?. Thus to train only the head of a large conv-net (like resnet50) by mixing-up two 7x7x2048 or so activations?

Yes, you just have to use pytorch hooks for that.

Hi @sgugger,

Is is possible we use mixup for certain categories only? For example, we do not use the mixup feature for the last category, but use this for all other categories ?

Besides, I notice: when enable mixup feature, it seems the training loss is always bigger than valid loss. Is this normal?


Is there evidence this mixup method of data aug generalizes beyond CIFAR10 and imagenet?

You’d have to rewrite your version of the callback for that, but it’s possible (anything is possible, you just have to code it :wink: ).
Mixup is a tool to avoid ovefitting, so yes, it’s normal your training loss is bigger. Try reducing the alpha parameter.

1 Like