This caught my eye the other day. They use wavelet pooling/unpooling for downsampling and upsampling to create more photorealistic style transfer images.
Generalised IOU. A Metric and A Loss for Bounding Box Regression. Solving, “It would be nice if IoU indicated if our new, better prediction was closer to the ground truth than the first prediction, even in cases of no intersection”
From the abstract:
Authors propose to jointly capture the full structure of a neural network by parametrizing it with a single high-order tensor, the modes of which represent each of the architectural design parameters of the network (e.g. number of convolutional blocks, depth, number of stacks, input features, etc). This parametrization allows to regularize the whole network and drastically reduce the number of parameters.
They also show that their approach can achieve superior performance with low compression rates, and attain high compression rates with a negligible drop in accuracy, on both the challenging task of human pose estimation and semantic face segmentation
Very exciting new paper “Unsupervised Data Augmentation” just out today, if the claims are accurate Works with all types of data including text and images.
On the IMDb text classification dataset, with only 20 labeled examples, UDA outperforms the state-of-the-art model trained on 25,000 labeled examples. On standard semi-supervised learning benchmarks, CIFAR-10 with 4,000 examples and SVHN with 1,000 examples, UDA outperforms all previous approaches and reduces more than 30% of the error rates of state-of-the-art methods
Just want to add to this one that the concept introduced here of Training Signal Annealing (TSA) is also a really interesting idea to effectively remove the contribution to the gradients of examples in the dataset that the model is already classifying correctly above some threshold while this threshold gradually decreases starting from 1/num_classes to 1.
I found the experiments from Deconstructing Lottery Tickets: Zeros, Signs, and the Supermask interesting:
The authors found that after training the LT networks, preserving the signs of weights and keeping the weights that are away from zeros using the supermask can draw out good performance (80 percent test accuracy on MNIST and 24 percent on CIFAR-10) without additional training.
Batch Normalization controls the change of the layers’ input distributions during training to reduce the so-called internal covariate shift. The popular belief is that its effectiveness stems from that. However, the authors demonstrate that such distributional stability of layer inputs has little to do with the success of BatchNorm. Instead, they uncover a more fundamental impact of BatchNorm on the training process: it makes the optimization landscape significantly smoother.
Exciting new paper out this week which combines mixup with unlabeled data and the results are really impressive!
In this work, we unify the current dominant approaches for semi-supervised learning to produce a new algorithm, MixMatch, that works by guessing low-entropy labels for data-augmented unlabeled examples and mixing labeled and unlabeled data using MixUp. We show that MixMatch obtains state-of-the-art results by a large margin across many datasets and labeled data amounts. For example, on CIFAR-10 with 250 labels, we reduce error rate by a factor of 4 (from 38% to 11%)
My fav section is in the ablation tests, where they compared MixMatch without MixUp 39.11 test error vs MixMatch with MixUp 11.80 test error (CIFAR10 250 labels, rest unlabeled)…this shows the power of MixUp especially when working with limited data!
MixMatch: A Holistic Approach to Semi-Supervised Learning
Amazing results! Thanks for posting, @jamesrequa
Is a Python implementation of this technique available yet?
This fascinates me
Can we come up with a Fastai function that ‘prunes’ our final models? Maybe it goes through all the weights and chooses all over a certain threshold? This minimized version could then be used for inference in production on CPU (with much faster performance) and as a starting point for further training.
Neural network pruning techniques can reduce the parameter counts of trained net-works by over 90%, decreasing storage requirements and improving computationalperformance of inference without compromising accuracy. However, contemporaryexperience is that the sparse architectures produced by pruning are difficult to trainfrom the start, which would similarly improve training performance.We find that a standard pruning technique naturally uncovers subnetworks whoseinitializations made them capable of training effectively. Based on these results, wearticulate thelottery ticket hypothesis: dense, randomly-initialized, feed-forwardnetworks contain subnetworks (winning tickets) that—when trained in isolation—reach test accuracy comparable to the original network in a similar number ofiterations. The winning tickets we find have won the initialization lottery: theirconnections have initial weights that make training particularly effective.We present an algorithm to identify winning tickets and a series of experimentsthat support the lottery ticket hypothesis and the importance of these fortuitousinitializations. We consistently find winning tickets that are less than 10-20% ofthe size of several fully-connected and convolutional feed-forward architecturesfor MNIST and CIFAR10. Above this size, the winning tickets that we find learnfaster than the original network and reach higher test accuracy.
Non-robust features (data sets appear to be having incorrect labels to humans) only are sufficiently useful for a classification task:
Adversarial Examples Are Not Bugs, They Are Features (https://arxiv.org/abs/1905.02175)
Not that I have seen no, although the authors for the paper mentioned they would be releasing the code soon. I actually think though it would be great practice for us to try implementing it ourselves based on the paper. I love the way the paper is written because compared with other papers including the recent UDA paper, they give very clear and straightforward steps on implementation.
I haven’t looked at the code but I noticed someone making an attempt at mixmatch in pytorch here https://github.com/gan3sh500/mixmatch-pytorch/blob/master/README.md
Best approach is probably to start with fastai’s mixup code and go from there?
It is common in recommendation systems that users both consume and produce information as they make strategic choices under uncertainty. While a social planner would balance “exploration” and “exploitation” using a multi-armed bandit algorithm, users’ incentives may tilt this balance in favor of exploitation. We consider Bayesian Exploration: a simple model in which the recommendation system (the “principal”) controls the information flow to the users (the “agents”) and strives to incentivize exploration via information asymmetry. A single round of this model is a version of a well-known “Bayesian Persuasion game” from [Kamenica and Gentzkow]. We allow heterogeneous users, relaxing a major assumption from prior work that users have the same preferences from one time step to another. The goal is now to learn the best personalized recommendations. One particular challenge is that it may be impossible to incentivize some of the user types to take some of the actions, no matter what the principal does or how much time she has. We consider several versions of the model, depending on whether and when the user types are reported to the principal, and design a near-optimal “recommendation policy” for each version. We also investigate how the model choice and the diversity of user types impact the set of actions that can possibly be “explored” by each type.
No free lunch with batch norm: the accelerated training properties and occasionally higher clean
test accuracy come at the cost of robustness, both to additive noise and for adversarial perturbations.
Incredible paper. I made a DataLoader but I can’t figure out the cross_entropy part of the loss function since y is continuous. fastai mixup has a stack_y attribute which might be the better approach.
https://github.com/sshleifer/mixmatch is my code if anybody interested
A video presentation on this is available here.
I find it fascinating that a network can be pruned so significantly and produce the same results. It suggests starting with large networks is very wasteful. It helps with faster convergence because there is more memory available to model the space but pruning shows it is far from optimal.
Here is a thought.
Why don’t we start with smaller networks that through annealing grow and shrink and grow and shrink until we see no additional benefit during training? Has anyone seen any experiments using this approach?
Loving that they turned a meme into a dataset.
As per the paper
- a reconstruction that is accurate enough to serve as a replacement for the MNIST dataset
- traced each MNIST digit to its NIST source and its rich metadata such as writer identifier, partition identifier, etc.
- Reconstructed the complete MNIST test set with 60,000 samples instead of the usual 10,000
TVM compiler speeds up PyTorch. Same idea as Swift, but Is it as fast?