Unsupervised data augmentation (UDA) from Google...better than mixup

This new paper from Google brain explains unsupervised data augmentation. Their results appear to show it outperforms pretty much all other data augmentation including mixup for cases where labelled data is small…

For example, on the IMDb text classification dataset, with only 20 labeled examples, UDA outperforms the state-of-the-art model trained on 25,000 labeled examples. On standard semi-supervised learning benchmarks, CIFAR-10 with 4,000 examples and SVHN with 1,000 examples, UDA outperforms all previous approaches and reduces more than 30% of the error rates of state-of-the-art methods: going from 7.66% to 5.27% and from 3.53% to 2.46% respectively.


I also shared earlier here :slight_smile:

They actually compared UDA with an alternative method of mixup called mixmixup which I couldn’t find any info on – did anyone else find it?. UPDATE: here is the mixmixup paper for those interested: https://openreview.net/pdf?id=r1gp1jRN_4

Overall, I don’t think this paper gave enough credit to mixup (I now use mixup as default in all of my projects – the results are that good) and in fact I think they could potentially be used together for even better performance. Anyway, UDA is still exciting nonetheless if the claims are accurate.


Hmmm, there’s no code available for this yet is there? I’m interested in trying it out but I get the sense it’s a little more finicky than the techniques we looked at in class (eg. mixup, label smoothing). For example, using UDA on ImageNet required adding an entropy term to the loss function, modifying softmax and masking out unlabelled samples when the model wasn’t confident about its predictions on them.

Does anyone know how to select training examples from the labelled and unlabelled samples? Do we shuffle the datasets together, but keep track of labelled vs. unlabelled? Or can we train exclusively on labelled examples at the start of training and then process the unlabelled training examples afterward?

1 Like

Not that I am aware of no. I personally found the MixMatch paper (also shared here https://forums.fast.ai/t/good-readings-2019/39367/54?u=jamesrequa) to be much more interesting, also involves using an unlabeled dataset but is a lot more straightforward to implement (plus it uses mixup :slight_smile: ). For UDA paper, I actually found the TSA technique the most interesting part.

I think it makes sense to have two separate data loaders one for the labeled dataset and one from unlabeled so you can grab a batch from each separately so that the losses for each could be also processed separately/calculated differently and then combine the batch randomly. I am not sure yet on the specifics of how UDA should be done but at least this is the case with MixMatch.

I don’t usually like reddit. But this Q/A with author of the UDA paper was helpful to address some of my questions about their approach.


This is fantastic and potentially revolutionary, the IMDB example using 20 labeled samples outperforming previous results says a lot, let´s see when the code is released, but the approach seems really interesting, combining the classic cross-entropy loss on the labelled samples with an additional consistency loss applied to the combination of unlabelled samples and augmentations of those unlabelled samples, fascinating, hopefully the days of massive labelling are gradually coming to an end

Looks great. Is there a pytorch implementation (that works) available?

Unfortunately none so far. The co-author on Reddit says code is coming soon, but nothing yet. Also since they are with Google it will likely be TF though I’m sure we can port over to Pytorch in reasonably fast timeframe.

1 Like

I am working on coding up the back translation stuff. It seems straight forward. That also seems to be the most relevant part for the text classifier, but I will find out soon enough!


is there update on this thread ? Did somebody tried to use it and saw improvements ?

actually there is - they just released an update to the paper about two days ago! Talked about a new annealing strategy, etc. They basically monitor the validation per category and then offload those that hit a threshold in order to avoid overfitting, while continuing with other categories.
I’m going to re-read it again next week and then write an article summarizing it. The code is also out now, I dont’ recall it being available earlierl.
Here’s the link to the paper (with updated version I mean):


I’ve been working on a fastai implementation, and haven’t been getting great results. I also tried this pytorch implementation: https://github.com/ildoonet/unsupervised-data-augmentation

Running that code on CIFAR10 produced results much closer to the paper. I then modified that code to use fastai augmentation instead of AutoAugment, and found that the results suffered considerably, and were similar to those I had been getting with my pure fastai code.

The conclusion I draw from this is that AutoAugment is considerably more important to UDA than I had thought. I had assumed that since fastai’s augmentation system is robust and operates similarly to policies produced by AutoAugment that there would only be a small performance hit from not using AA. But that appears not to be the case, and there is a 15%+ difference in error rates.

This leaves me with implementing AutoAugment in fastai. Checking the appendix of the AutoAugment paper, the CIFAR policy has a number of transforms that aren’t implemented by fastai’s transform library such as sharpness, equalize, and solarize. I imagine implementing these transforms wouldn’t be too difficult individually, but getting the whole AutoAugment algorithm to operate would definitely be a lot of work.


Very interesting work and conclusions. I might recommend using the albumentations library. It has many different types of augmentations. I haven’t seen anybody using it with fastai, but I am sure it is probably not terribly difficult to do so, as it is easily used with PyTorch. This way all you have to do is get albumentations to work with fastai and use the desired augmentations, rather than implementing from scratch each of the transforms.


I have same issues, I have implemented the paper but it did not show comparable performance… I guess you are right maybe its because of transformations. I still don’t understand why this specific transformation will have such a huge effect ? In theory any transformation that you apply to image and compare to original image should result in nice KL divergence between them.

Maybe AutoAugment transform are very smooth when compare to original image and as a result there KL divergence is small ?

I am not sure if my explanation make sense… please feel free to correct me.

1 Like

Also really interested in this, I’ve been trying Virtual Adversarial Training (VAT) for tabular data with only limited success. In the UDA paper (which I thought was a great read and made a lot of sense) they claim that the main benefit over VAT is the quality of the augmentations.

I’m trying to implement UDA in PyTorch at the moment, I would be interested to know if anyone has a PyTorch version that replicates the results in the paper, and also, does anyone have any thoughts on augmentation for tabular data. IIRC they only generate one augmentation per unlabeled example which makes it simpler because it can all be done upfront rather than on the fly.

1 Like

There’s a technique that was pioneered in the Porto Seguro Kaggle competition called “swap noise” that randomly swaps each element in a row with another row.

You can read more about it in this article


Cheers, good find!

1 Like