I tried applying mixup to the text classifier. To get it to run, all I did was change this one line from: new_input = (last_input * lambd.view(lambd.size(0),1,1,1) + x1 * (1-lambd).view(lambd.size(0),1,1,1))
to: new_input = (last_input * lambd.view(lambd.size(0),1) + x1 * (1-lambd).view(lambd.size(0),1))
My accuracy remained pretty similar. Am I missing something here or should it work with just this change?
Hi,
I’m trying to use mixup in text classification task too. I’m not working on fast.ai project, but trying to apply mixup on BERT. But I’m not sure if I did it correctly. Have you succeed in using mixup in your project?
Note that there is a new variant of mixup out there : Manifold Mixup:
They do the mixup step in a random inner layer which leads to increased benefits and would make it easy to apply the concept to other domains such as NLP (using the last layer).
This looks incredible az it addresses the biggest dl concern of improving generalization. Thanks for posting.
Now we just need a fastai impl to put it to use!
I feel like it would be doable to recycle a lot of the code of the existing mixup implementation. I might try to implement a prototype this weekend if I have the time… (but feel free to beat me to it )
You can expect manifold mixup and output mixupto be slower and consume a bit more memory (due to needing two forward pass)(not anymore thanks to @MicPie) but they let you use a larger learning rate and might be overall worth it.
On the tiny benchmark manifold mixup is not sensibly better than input mixup (but the author says that the benefit appear after a large number of epochs) but I observed nice speed ups to the convergence with output mixup. Now we need to validate that on a larger dataset.
I will test it on a private dataset this week but I would be happy to get an outside benchmark comparing no mixup, input mixup, manifold mixup and output mixup (@LessW2020 ?).
Input mixup does not need two forward pass as you can mix the inputs and then do a single pass but here you do the mixing with intermediate results.
However, to compensate that downside, I added a flag to use both passes fully and get two outputs per input pair (set to True by default) : epochs are still slower but they are twice as informative.
(I believe I should be able to reduce memory usage but that’s about it)
As far as I understand (which is necessarily not correct), would be that you could apply this approach from the docs also to the intermediate activations?
You can get the mixup activations from image 1 derived from the batch and the activations for image 2 from the same but shuffled batch (the shuffling by random permutation would be across the batch dimension, see the corresponding lines of code in the fastai v1 codebase).
PS: I looked into the paper to verify my claim and to check if this makes sense at all, and they have it also mentioned under “2. Manifold Mixup”:
However, as long as the implementation works it is super as it is, but with this tweak it could be optimized further.
I believe that would work and get everything (performance/memory) to the same level as input mixup. You can expect an update to the project within a few hours… (done!)