Good readings 2020

harikrishnanrajeev · October 7, 2020, 11:14am

A nice writeup on TensorSensor . Thanks to Terence Parr.

s.s.o · October 9, 2020, 10:31am

LambdaNetworks: Modeling long-range Interactions without Attention

from the paper : we introduce LambdaResNets, a familyof architectures that considerably improve the speed-accuracy tradeoff of imageclassification models. LambdaResNets reach state-of-the-art accuracies on Ima-geNet while being∼4.5x faster than the popular EfficientNets on modern machinelearning accelerators.

pytorch code : https://github.com/lucidrains/lambda-networks

AmorfEvo · October 17, 2020, 5:58pm

Now not just yours ^^ (thx)

morgan · November 19, 2020, 10:41pm

Q-BERT: Hessian Based Ultra Low Precision Quantization of BERT

https://arxiv.org/pdf/1909.05840.pdf

6-7x reduction in memory compared to BERT baseline with a 4/8-bit (weights/embedding) group-wise quantization scheme w/minimal drop in performance

Transformer based architectures have become de-facto models used for a range of Natural Language Processing tasks. In particular, the BERT based models achieved significant accuracy gain for GLUE tasks, CoNLL-03 and SQuAD. However, BERT based models have a prohibitive memory footprint and latency. As a result, deploying BERT based models in resource constrained environments has become a challenging task. In this work, we perform an extensive analysis of fine-tuned BERT models using second order Hessian information, and we use our results to propose a novel method for quantizing BERT models to ultra low precision. In particular, we propose a new group-wise quantization scheme, and we use a Hessian based mix-precision method to compress the model further. We extensively test our proposed method on BERT downstream tasks of SST-2, MNLI, CoNLL-03, and SQuAD. We can achieve comparable performance to baseline with at most 2.3% performance degradation, even with ultra-low precision quantization down to 2 bits, corresponding up to 13× compression of the model parameters, and up to 4× compression of the embedding table as well as activations. Among all tasks, we observed the highest performance loss for BERT fine-tuned on SQuAD. By probing into the Hessian based analysis as well as visualization, we show that this is related to the fact that current training/fine-tuning strategy of BERT does not converge for SQuAD.

s.s.o · November 20, 2020, 4:22pm

Impact of Accuracy on Model Interpretations

Model interpretations are often used in practice to extract real world insights from machine learning models. These interpretations have a wide range of applications; they can be presented as business recommendations or used to evaluate model bias. It is vital for a data scientist to choose trustworthy interpretations to drive real world impact. Doing so requires an understanding of how the accuracy of a model impacts the quality of standard interpretation tools. In this paper, we will explore how a model’s predictive accuracy affects interpretation quality. We propose two metrics to quantify the quality of an interpretation and design an experiment to test how these metrics vary with model accuracy. We find that for datasets that can be modeled accurately by a variety of methods, simpler methods yield higher quality interpretations. We also identify which interpretation method works the best for lower levels of model accuracy.

https://arxiv.org/pdf/2011.09903.pdf

MicPie · November 26, 2020, 9:53am

More interesting work on self-supervised learning:

ilovescience · November 28, 2020, 9:56pm

GradAug: A New Regularization Method for Deep Neural Networks

We propose a new regularization method to alleviate over-fitting in deep neural networks. The key idea is utilizing randomly transformed training samples to regularize a set of sub-networks, which are originated by sampling the width of the original network, in the training process. As such, the proposed method introduces self-guided disturbances to the raw gradients of the network and therefore is termed as Gradient Augmentation (GradAug). We demonstrate that GradAug can help the network learn well-generalized and more diverse representations. Moreover, it is easy to implement and can be applied to various structures and applications. GradAug improves ResNet-50 to 78.79% on ImageNet classification, which is a new state-of-the-art accuracy. By combining with CutMix, it further boosts the performance to 79.67%, which outperforms an ensemble of advanced training tricks. The generalization ability is evaluated on COCO object detection and instance segmentation where GradAug significantly surpasses other state-of-the-art methods. GradAug is also robust to image distortions and FGSM adversarial attacks and is highly effective in low data regimes. Code is available at GitHub - taoyang1122/GradAug: [NeurIPS'20] GradAug: A New Regularization Method for Deep Neural Networks

morgan · November 30, 2020, 12:23pm

This looks super interesting, they focus on images, but don’t see why this couldn’t be used for any data type right?

Kasianenko · December 3, 2020, 4:57pm

I found this blogpost with short description of paper really interesting and relevant to all DL fields https://mohammadpz.github.io/GradientStarvation.html#

ilovescience · December 7, 2020, 11:56pm

Sharpness-Aware Minimization for Efficiently Improving Generalization

New ImageNet SOTA with improved generalization capabilities and robustness to label noise.

Abstract:

In today’s heavily overparameterized models, the value of the training loss provides few guarantees on model generalization ability. Indeed, optimizing only the training loss value, as is commonly done, can easily lead to suboptimal model quality. Motivated by the connection between geometry of the loss landscape and generalization – including a generalization bound that we prove here – we introduce a novel, effective procedure for instead simultaneously minimizing loss value and loss sharpness. In particular, our procedure, Sharpness-Aware Minimization (SAM), seeks parameters that lie in neighborhoods having uniformly low loss; this formulation results in a min-max optimization problem on which gradient descent can be performed efficiently. We present empirical results showing that SAM improves model generalization across a variety of benchmark datasets (e.g., CIFAR-{10, 100}, ImageNet, finetuning tasks) and models, yielding novel state-of-the-art performance for several. Additionally, we find that SAM natively provides robustness to label noise on par with that provided by state-of-the-art procedures that specifically target learning with noisy labels.

johannesstutz · December 8, 2020, 10:11am

Super interesting, thank you! The mathy bits go a bit over my head, but the approach makes intuitive sense and the results look impressive. Efficiency and generalization are things we can always use more of

The idea of searching for “flat” minima can be traced back to Hochreiter and Schmidhuber (1995)

Why does this not surprise me

On the topic of generalization
This article talks about a huge paper, also from Google, where they investigate why models often generalize so badly to real life. I can highly recommend reading it!

s.s.o · December 10, 2020, 9:00pm

Canonical Capsules: Unsupervised Capsules in Canonical Pose

Weiwei Sun 1 4 * Andrea Tagliasacchi 2 3 * Boyang Deng 3 Sara Sabour 2 3

Soroosh Yazdani 3 Geoffrey Hinton 2 3 Kwang Moo Yi 1 4

1 University of British Columbia 2 University of Toronto

3 Google Research 4 University of Victoria * Equal contributions

[ArXiv] [Code] [PDF]

s.s.o · December 10, 2020, 9:04pm

Sharpness-Aware Minimization for Efficiently Improving Generalization

code https://github.com/google-research/sam

s.s.o · December 24, 2020, 10:29pm

it might be interesting too and timm implementation also available

s.s.o · December 27, 2020, 12:30pm

seams giving good results

v.za · December 28, 2020, 8:48pm

Thanks!

muellerzr · March 23, 2021, 8:18pm

8 posts were merged into an existing topic: Good Readings 2021