Mixup data augmentation

Hi,

I tried applying mixup to the text classifier. To get it to run, all I did was change this one line from:
new_input = (last_input * lambd.view(lambd.size(0),1,1,1) + x1 * (1-lambd).view(lambd.size(0),1,1,1))
to:
new_input = (last_input * lambd.view(lambd.size(0),1) + x1 * (1-lambd).view(lambd.size(0),1))

My accuracy remained pretty similar. Am I missing something here or should it work with just this change?

1 Like

Hi,
I’m trying to use mixup in text classification task too. I’m not working on fast.ai project, but trying to apply mixup on BERT. But I’m not sure if I did it correctly. Have you succeed in using mixup in your project?

in case of multi class segmentation problem … should the default setting of stack_x,y work
which is stack_x=false,stack_y=True…
?

You probably need stack_y=False for multiclassification.

1 Like

How about stack x …?
If we make stack x true ,it would linear mix two class images together ?

@sgugger below are the changes i made to make mix up work for segmentation problems . Let me know if this looks ok

class MixUpCallback1(LearnerCallback):
    "Callback that creates the mixed-up input and target."
    def __init__(self, learn:Learner, alpha:float=0.4, stack_x:bool=False, stack_y:bool=False):
        super().__init__(learn)
        self.alpha,self.stack_x,self.stack_y = alpha,stack_x,stack_y
    
    def on_train_begin(self, **kwargs):
        if self.stack_y: self.learn.loss_func = MixUpLoss(self.learn.loss_func)
        
    def on_batch_begin(self, last_input, last_target, train, **kwargs):
        "Applies mixup to `last_input` and `last_target` if `train`."
        if not train: return
        lambd = np.random.beta(self.alpha, self.alpha, last_target.size(0))
        lambd = np.concatenate([lambd[:,None], 1-lambd[:,None]], 1).max(1)
        lambd = last_input.new(lambd)
        #print(lambd.size(),'lambd')
        shuffle = torch.randperm(last_target.size(0)).to(last_input.device)
        x1, y1 = last_input[shuffle], last_target[shuffle]
        if self.stack_x:
            new_input = [last_input, last_input[shuffle], lambd]
        else: 
            out_shape = [lambd.size(0)] + [1 for _ in range(len(x1.shape) - 1)]
            new_input = (last_input * lambd.view(out_shape) + x1 * (1-lambd).view(out_shape))
        if self.stack_y:
            print(last_target[:,None].size(),y1[:,None].size(),lambd[:,None].float().size())
            new_target = torch.cat([last_target[:,None].float(), y1[:,None].float(), lambd[:,None].float()], 1)
        else:
            if len(last_target.shape) == 2:
                lambd = lambd.unsqueeze(1).float()
            out_shape = [lambd.size(0)] + [1 for _ in range(len(y1.shape) - 1)]
  #added outshape for target also,may we need an if ,if shape>2 
            new_target = last_target.half() * lambd.view(out_shape)  + y1.half() * (1-lambd).view(out_shape) #if we use fp16 we need to half and convert this half back #to long while passing to loss...
        return {'last_input': new_input, 'last_target': new_target}  
    
    def on_train_end(self, **kwargs):
        if self.stack_y: self.learn.loss_func = self.learn.loss_func.get_old()
5 Likes

Ah yes, this is probably necessary as well. Would you mind adding it in a PR?

sure … is there a process that is followed for same. I am totally new to PR process.

Did you ever submit this?

Note that there is a new variant of mixup out there : Manifold Mixup:

They do the mixup step in a random inner layer which leads to increased benefits and would make it easy to apply the concept to other domains such as NLP (using the last layer).

3 Likes

This looks incredible az it addresses the biggest dl concern of improving generalization. Thanks for posting.
Now we just need a fastai impl to put it to use!

I feel like it would be doable to recycle a lot of the code of the existing mixup implementation. I might try to implement a prototype this weekend if I have the time… (but feel free to beat me to it :slight_smile: )

1 Like

Here is my fastai V1 implementation and a small demo notebook.

You can expect manifold mixup and output mixup to be slower and consume a bit more memory (due to needing two forward pass)(not anymore thanks to @MicPie) but they let you use a larger learning rate and might be overall worth it.

On the tiny benchmark manifold mixup is not sensibly better than input mixup (but the author says that the benefit appear after a large number of epochs) but I observed nice speed ups to the convergence with output mixup. Now we need to validate that on a larger dataset.

I will test it on a private dataset this week but I would be happy to get an outside benchmark comparing no mixup, input mixup, manifold mixup and output mixup (@LessW2020 ?).

4 Likes

Have a look at the docs for how traditional mixup was implemented in the fastai library: https://docs.fast.ai/callbacks.mixup.html#Mixup-implementation-in-the-library

There the tricks are nicely outlined and I guess they should be applicable to manifold mixup.

I already did and took some ideas :wink:

Input mixup does not need two forward pass as you can mix the inputs and then do a single pass but here you do the mixing with intermediate results.

However, to compensate that downside, I added a flag to use both passes fully and get two outputs per input pair (set to True by default) : epochs are still slower but they are twice as informative.

(I believe I should be able to reduce memory usage but that’s about it)

As far as I understand (which is necessarily not correct), would be that you could apply this approach from the docs also to the intermediate activations?

You can get the mixup activations from image 1 derived from the batch and the activations for image 2 from the same but shuffled batch (the shuffling by random permutation would be across the batch dimension, see the corresponding lines of code in the fastai v1 codebase).

PS: I looked into the paper to verify my claim and to check if this makes sense at all, and they have it also mentioned under “2. Manifold Mixup”:

However, as long as the implementation works it is super as it is, but with this tweak it could be optimized further. :slight_smile:

1 Like

I am already using a single batch and a shuffled version.

The problem is that input mixup can be summarised as :

input = a*input1 + (1-a)*input2
output = f(g(x)) // single forward pass

While here we are doing :

intermediate1 = g(input1) // first partial forward pass
intermediate2 = g(input2) // second partial forward pass
intermediate = a*intermediate1 + (1-a)*intermediate2
output = f(intermediate) // end of the forward pass

Thus we pay for roughly two forward pass (which is not twice as slow as we are loading a single batch and doing a single backward pass).

I compensate by (optionally) producing two outputs :

intermediate1 = g(input1)
intermediate2 = g(input2)
intermediate12 = a*intermediate1 + (1-a)*intermediate2
intermediate21 = a*intermediate2 + (1-a)*intermediate1
output1 = f(intermediate12)
output2 = f(intermediate21)

Which is not slower but gives us more bang for our bucks.

I think I finally realized what you mean :partying_face:. Doing something like that :

intermediate1 = g(input)
intermediate2 = shuffle[intermediate1]
intermediate = a*intermediate1 + (1-a)*intermediate2
output = f(intermediate)

I believe that would work and get everything (performance/memory) to the same level as input mixup. You can expect an update to the project within a few hours… (done!)

2 Likes

I was also not sure and wanted to ask the same, i.e., how you get intermediate2.
Sometimes it is tricky to discuss over text messages. :smiley:

I am looking forward to your implementation!

The repository has been updated and tested: performances are now on par with input mixup :smiley:

(plus the code is shorter and I was able to make it more similar to fastai’s input mixup implementation)

1 Like