This new paper from Google brain explains unsupervised data augmentation. Their results appear to show it outperforms pretty much all other data augmentation including mixup for cases where labelled data is small…
For example, on the IMDb text classification dataset, with only 20 labeled examples, UDA outperforms the state-of-the-art model trained on 25,000 labeled examples. On standard semi-supervised learning benchmarks, CIFAR-10 with 4,000 examples and SVHN with 1,000 examples, UDA outperforms all previous approaches and reduces more than 30% of the error rates of state-of-the-art methods: going from 7.66% to 5.27% and from 3.53% to 2.46% respectively.
They actually compared UDA with an alternative method of mixup called mixmixup which I couldn’t find any info on – did anyone else find it?. UPDATE: here is the mixmixup paper for those interested: https://openreview.net/pdf?id=r1gp1jRN_4
Overall, I don’t think this paper gave enough credit to mixup (I now use mixup as default in all of my projects – the results are that good) and in fact I think they could potentially be used together for even better performance. Anyway, UDA is still exciting nonetheless if the claims are accurate.
Hmmm, there’s no code available for this yet is there? I’m interested in trying it out but I get the sense it’s a little more finicky than the techniques we looked at in class (eg. mixup, label smoothing). For example, using UDA on ImageNet required adding an entropy term to the loss function, modifying softmax and masking out unlabelled samples when the model wasn’t confident about its predictions on them.
Does anyone know how to select training examples from the labelled and unlabelled samples? Do we shuffle the datasets together, but keep track of labelled vs. unlabelled? Or can we train exclusively on labelled examples at the start of training and then process the unlabelled training examples afterward?
Not that I am aware of no. I personally found the MixMatch paper (also shared here https://forums.fast.ai/t/good-readings-2019/39367/54?u=jamesrequa) to be much more interesting, also involves using an unlabeled dataset but is a lot more straightforward to implement (plus it uses mixup ). For UDA paper, I actually found the TSA technique the most interesting part.
I think it makes sense to have two separate data loaders one for the labeled dataset and one from unlabeled so you can grab a batch from each separately so that the losses for each could be also processed separately/calculated differently and then combine the batch randomly. I am not sure yet on the specifics of how UDA should be done but at least this is the case with MixMatch.
This is fantastic and potentially revolutionary, the IMDB example using 20 labeled samples outperforming previous results says a lot, let´s see when the code is released, but the approach seems really interesting, combining the classic cross-entropy loss on the labelled samples with an additional consistency loss applied to the combination of unlabelled samples and augmentations of those unlabelled samples, fascinating, hopefully the days of massive labelling are gradually coming to an end
Unfortunately none so far. The co-author on Reddit says code is coming soon, but nothing yet. Also since they are with Google it will likely be TF though I’m sure we can port over to Pytorch in reasonably fast timeframe.
actually there is - they just released an update to the paper about two days ago! Talked about a new annealing strategy, etc. They basically monitor the validation per category and then offload those that hit a threshold in order to avoid overfitting, while continuing with other categories.
I’m going to re-read it again next week and then write an article summarizing it. The code is also out now, I dont’ recall it being available earlierl.
Here’s the link to the paper (with updated version I mean):
Running that code on CIFAR10 produced results much closer to the paper. I then modified that code to use fastai augmentation instead of AutoAugment, and found that the results suffered considerably, and were similar to those I had been getting with my pure fastai code.
The conclusion I draw from this is that AutoAugment is considerably more important to UDA than I had thought. I had assumed that since fastai’s augmentation system is robust and operates similarly to policies produced by AutoAugment that there would only be a small performance hit from not using AA. But that appears not to be the case, and there is a 15%+ difference in error rates.
This leaves me with implementing AutoAugment in fastai. Checking the appendix of the AutoAugment paper, the CIFAR policy has a number of transforms that aren’t implemented by fastai’s transform library such as sharpness, equalize, and solarize. I imagine implementing these transforms wouldn’t be too difficult individually, but getting the whole AutoAugment algorithm to operate would definitely be a lot of work.
Very interesting work and conclusions. I might recommend using the albumentations library. It has many different types of augmentations. I haven’t seen anybody using it with fastai, but I am sure it is probably not terribly difficult to do so, as it is easily used with PyTorch. This way all you have to do is get albumentations to work with fastai and use the desired augmentations, rather than implementing from scratch each of the transforms.
I have same issues, I have implemented the paper but it did not show comparable performance… I guess you are right maybe its because of transformations. I still don’t understand why this specific transformation will have such a huge effect ? In theory any transformation that you apply to image and compare to original image should result in nice KL divergence between them.
Maybe AutoAugment transform are very smooth when compare to original image and as a result there KL divergence is small ?
I am not sure if my explanation make sense… please feel free to correct me.
Also really interested in this, I’ve been trying Virtual Adversarial Training (VAT) for tabular data with only limited success. In the UDA paper (which I thought was a great read and made a lot of sense) they claim that the main benefit over VAT is the quality of the augmentations.
I’m trying to implement UDA in PyTorch at the moment, I would be interested to know if anyone has a PyTorch version that replicates the results in the paper, and also, does anyone have any thoughts on augmentation for tabular data. IIRC they only generate one augmentation per unlabeled example which makes it simpler because it can all be done upfront rather than on the fly.