Good readings 2020

Interesting report:

5 Likes

One-sentence Summary: Transformers applied directly to image patches and pre-trained on large datasets work really well on image classification

3 Likes

the code for An Image is Worth 16x16 Words: Transformers for Image Recognition…

5 Likes

A nice writeup on TensorSensor . Thanks to Terence Parr.

2 Likes

LambdaNetworks: Modeling long-range Interactions without Attention

from the paper : we introduce LambdaResNets, a familyof architectures that considerably improve the speed-accuracy tradeoff of imageclassification models. LambdaResNets reach state-of-the-art accuracies on Ima-geNet while being∼4.5x faster than the popular EfficientNets on modern machinelearning accelerators.

pytorch code : https://github.com/lucidrains/lambda-networks

5 Likes

Now not just yours ^^ (thx)

1 Like

Q-BERT: Hessian Based Ultra Low Precision Quantization of BERT

6-7x reduction in memory compared to BERT baseline with a 4/8-bit (weights/embedding) group-wise quantization scheme w/minimal drop in performance

Transformer based architectures have become de-facto models used for a range of Natural Language Processing tasks. In particular, the BERT based models achieved significant accuracy gain for GLUE tasks, CoNLL-03 and SQuAD. However, BERT based models have a prohibitive memory footprint and latency. As a result, deploying BERT based models in resource constrained environments has become a challenging task. In this work, we perform an extensive analysis of fine-tuned BERT models using second order Hessian information, and we use our results to propose a novel method for quantizing BERT models to ultra low precision. In particular, we propose a new group-wise quantization scheme, and we use a Hessian based mix-precision method to compress the model further. We extensively test our proposed method on BERT downstream tasks of SST-2, MNLI, CoNLL-03, and SQuAD. We can achieve comparable performance to baseline with at most 2.3% performance degradation, even with ultra-low precision quantization down to 2 bits, corresponding up to 13× compression of the model parameters, and up to 4× compression of the embedding table as well as activations. Among all tasks, we observed the highest performance loss for BERT fine-tuned on SQuAD. By probing into the Hessian based analysis as well as visualization, we show that this is related to the fact that current training/fine-tuning strategy of BERT does not converge for SQuAD.

3 Likes

Impact of Accuracy on Model Interpretations

Model interpretations are often used in practice to extract real world insights from machine learning models. These interpretations have a wide range of applications; they can be presented as business recommendations or used to evaluate model bias. It is vital for a data scientist to choose trustworthy interpretations to drive real world impact. Doing so requires an understanding of how the accuracy of a model impacts the quality of standard interpretation tools. In this paper, we will explore how a model’s predictive accuracy affects interpretation quality. We propose two metrics to quantify the quality of an interpretation and design an experiment to test how these metrics vary with model accuracy. We find that for datasets that can be modeled accurately by a variety of methods, simpler methods yield higher quality interpretations. We also identify which interpretation method works the best for lower levels of model accuracy.

1 Like

More interesting work on self-supervised learning:

1 Like

GradAug: A New Regularization Method for Deep Neural Networks

We propose a new regularization method to alleviate over-fitting in deep neural networks. The key idea is utilizing randomly transformed training samples to regularize a set of sub-networks, which are originated by sampling the width of the original network, in the training process. As such, the proposed method introduces self-guided disturbances to the raw gradients of the network and therefore is termed as Gradient Augmentation (GradAug). We demonstrate that GradAug can help the network learn well-generalized and more diverse representations. Moreover, it is easy to implement and can be applied to various structures and applications. GradAug improves ResNet-50 to 78.79% on ImageNet classification, which is a new state-of-the-art accuracy. By combining with CutMix, it further boosts the performance to 79.67%, which outperforms an ensemble of advanced training tricks. The generalization ability is evaluated on COCO object detection and instance segmentation where GradAug significantly surpasses other state-of-the-art methods. GradAug is also robust to image distortions and FGSM adversarial attacks and is highly effective in low data regimes. Code is available at https://github.com/taoyang1122/GradAug

4 Likes

This looks super interesting, they focus on images, but don’t see why this couldn’t be used for any data type right?

1 Like

I found this blogpost with short description of paper really interesting and relevant to all DL fields https://mohammadpz.github.io/GradientStarvation.html#

3 Likes

Sharpness-Aware Minimization for Efficiently Improving Generalization

New ImageNet SOTA with improved generalization capabilities and robustness to label noise.

Abstract:

In today’s heavily overparameterized models, the value of the training loss provides few guarantees on model generalization ability. Indeed, optimizing only the training loss value, as is commonly done, can easily lead to suboptimal model quality. Motivated by the connection between geometry of the loss landscape and generalization – including a generalization bound that we prove here – we introduce a novel, effective procedure for instead simultaneously minimizing loss value and loss sharpness. In particular, our procedure, Sharpness-Aware Minimization (SAM), seeks parameters that lie in neighborhoods having uniformly low loss; this formulation results in a min-max optimization problem on which gradient descent can be performed efficiently. We present empirical results showing that SAM improves model generalization across a variety of benchmark datasets (e.g., CIFAR-{10, 100}, ImageNet, finetuning tasks) and models, yielding novel state-of-the-art performance for several. Additionally, we find that SAM natively provides robustness to label noise on par with that provided by state-of-the-art procedures that specifically target learning with noisy labels.

4 Likes

Super interesting, thank you! The mathy bits go a bit over my head, but the approach makes intuitive sense and the results look impressive. Efficiency and generalization are things we can always use more of :slight_smile:

The idea of searching for “flat” minima can be traced back to Hochreiter and Schmidhuber (1995)

Why does this not surprise me :sweat_smile:

On the topic of generalization
This article talks about a huge paper, also from Google, where they investigate why models often generalize so badly to real life. I can highly recommend reading it!

2 Likes

Canonical Capsules: Unsupervised Capsules in Canonical Pose

Weiwei Sun 1 4 * Andrea Tagliasacchi 2 3 * Boyang Deng 3 Sara Sabour 2 3

Soroosh Yazdani 3 Geoffrey Hinton 2 3 Kwang Moo Yi 1 4

1 University of British Columbia 2 University of Toronto

3 Google Research 4 University of Victoria * Equal contributions

[ArXiv] [Code] [PDF]

Sharpness-Aware Minimization for Efficiently Improving Generalization

code https://github.com/google-research/sam

1 Like

it might be interesting too and timm implementation also available

3 Likes

seams giving good results

4 Likes

Thanks!

8 posts were merged into an existing topic: Good Readings 2021