Mixup data augmentation

sure … is there a process that is followed for same. I am totally new to PR process.

Did you ever submit this?

Note that there is a new variant of mixup out there : Manifold Mixup:

They do the mixup step in a random inner layer which leads to increased benefits and would make it easy to apply the concept to other domains such as NLP (using the last layer).

3 Likes

This looks incredible az it addresses the biggest dl concern of improving generalization. Thanks for posting.
Now we just need a fastai impl to put it to use!

I feel like it would be doable to recycle a lot of the code of the existing mixup implementation. I might try to implement a prototype this weekend if I have the time… (but feel free to beat me to it :slight_smile: )

1 Like

Here is my fastai V1 implementation and a small demo notebook.

You can expect manifold mixup and output mixup to be slower and consume a bit more memory (due to needing two forward pass)(not anymore thanks to @MicPie) but they let you use a larger learning rate and might be overall worth it.

On the tiny benchmark manifold mixup is not sensibly better than input mixup (but the author says that the benefit appear after a large number of epochs) but I observed nice speed ups to the convergence with output mixup. Now we need to validate that on a larger dataset.

I will test it on a private dataset this week but I would be happy to get an outside benchmark comparing no mixup, input mixup, manifold mixup and output mixup (@LessW2020 ?).

4 Likes

Have a look at the docs for how traditional mixup was implemented in the fastai library: https://docs.fast.ai/callbacks.mixup.html#Mixup-implementation-in-the-library

There the tricks are nicely outlined and I guess they should be applicable to manifold mixup.

I already did and took some ideas :wink:

Input mixup does not need two forward pass as you can mix the inputs and then do a single pass but here you do the mixing with intermediate results.

However, to compensate that downside, I added a flag to use both passes fully and get two outputs per input pair (set to True by default) : epochs are still slower but they are twice as informative.

(I believe I should be able to reduce memory usage but that’s about it)

As far as I understand (which is necessarily not correct), would be that you could apply this approach from the docs also to the intermediate activations?

You can get the mixup activations from image 1 derived from the batch and the activations for image 2 from the same but shuffled batch (the shuffling by random permutation would be across the batch dimension, see the corresponding lines of code in the fastai v1 codebase).

PS: I looked into the paper to verify my claim and to check if this makes sense at all, and they have it also mentioned under “2. Manifold Mixup”:

However, as long as the implementation works it is super as it is, but with this tweak it could be optimized further. :slight_smile:

1 Like

I am already using a single batch and a shuffled version.

The problem is that input mixup can be summarised as :

input = a*input1 + (1-a)*input2
output = f(g(x)) // single forward pass

While here we are doing :

intermediate1 = g(input1) // first partial forward pass
intermediate2 = g(input2) // second partial forward pass
intermediate = a*intermediate1 + (1-a)*intermediate2
output = f(intermediate) // end of the forward pass

Thus we pay for roughly two forward pass (which is not twice as slow as we are loading a single batch and doing a single backward pass).

I compensate by (optionally) producing two outputs :

intermediate1 = g(input1)
intermediate2 = g(input2)
intermediate12 = a*intermediate1 + (1-a)*intermediate2
intermediate21 = a*intermediate2 + (1-a)*intermediate1
output1 = f(intermediate12)
output2 = f(intermediate21)

Which is not slower but gives us more bang for our bucks.

I think I finally realized what you mean :partying_face:. Doing something like that :

intermediate1 = g(input)
intermediate2 = shuffle[intermediate1]
intermediate = a*intermediate1 + (1-a)*intermediate2
output = f(intermediate)

I believe that would work and get everything (performance/memory) to the same level as input mixup. You can expect an update to the project within a few hours… (done!)

2 Likes

I was also not sure and wanted to ask the same, i.e., how you get intermediate2.
Sometimes it is tricky to discuss over text messages. :smiley:

I am looking forward to your implementation!

The repository has been updated and tested: performances are now on par with input mixup :smiley:

(plus the code is shorter and I was able to make it more similar to fastai’s input mixup implementation)

1 Like

Great work @nestorDemeure!

Yes I would be happy to setup and test on both ImageWoof/Nette and a medical dataset I am working with for my current work.
I can hopefully get this done tomorrow and will update here.

Very excited to see this is available and also hope we can get it running in FA2 in the future, but let’s make sure it’s proving out with v1 in terms of results :).

2 Likes

If the method proves worthwhile I could probably work on a fastai V2 port next weekend (I have yet to install V2 as I would prefer to wait for the official release date).

I am personaly interested in both the possibility to easily use mixup on arbitrary inputs (which is there even if it does no better than input mixup) and improved calibration of the predictions (which I will measure on a personnal dataset in the next few days).

1 Like

Awesome! For what it’s worth - I switched my work from V1 to V2 last week and V2 has been quite stable…
I’m just tweaking a few augmentations (brightness, etc) and need to get BatchLossFilter (from @oguiza’s outstanding work) into v2 but I’ll be running v2 going forward.

Thus at least from my experience, v2 is looking pretty solid at least in terms of vision work at this point.

Great work @nestorDemeure - I was able to test both version of manifold today and in every case (all three datasets) output_manifold produced the best validation loss.
For my own work, it smashed the best validation loss I’ve had to date by a large margin (output_manifold + standard augmentation).

Here’s the standard benchmarking results (effect was not as big here as I only used flip + mixup versions, but in each case had the best results):

The internal manifold mixup consistently was the worst btw so I’d say no need to support that going forward.

For my own dataset I am using B4 and B5 efficientnets. It was hooking the last swish layer so I just wanted to confirm that is expected (I was thinking last conv layer and got the error about repeated forward pass) but as noted, it clearly did work.

If you are able to update it for FAi v2 I think it would be a very worthwhile addition!

3 Likes

internal is just letting manifold mixup draw a layer at random ? (as opposed to you forcing a single internal layer for all the run)

In which case great ! It correlates with what I observed while implementing mixup (that manifold mixup was more effective when, by luck, it sampled from the latest layers) which pushed me to add output_mixup.

I will try to implement it in V2 this weekend (the UI will probably be a bit simplified : using the lattest non-loss / non-softmax layer unless the user explicitely pass a target module).

1 Like

Great decision @nestorDemeure - it’s working super well for my own dataset and proved out on the benchmark woof/nette.

If you have a chance to write this up for FastAI2 this weekend I would greatly appreciate it as I can put it to use right away for testing on woof/nette and private work dataset.
Thanks for your development work on making this!

btw - if they weren’t doing the last output in the paper, I think you should do your own paper on it? :slight_smile: