A nice writeup on TensorSensor . Thanks to Terence Parr.
LambdaNetworks: Modeling long-range Interactions without Attention
from the paper : we introduce LambdaResNets, a familyof architectures that considerably improve the speed-accuracy tradeoff of imageclassification models. LambdaResNets reach state-of-the-art accuracies on Ima-geNet while beingâź4.5x faster than the popular EfficientNets on modern machinelearning accelerators.
pytorch code : https://github.com/lucidrains/lambda-networks
Now not just yours ^^ (thx)
Q-BERT: Hessian Based Ultra Low Precision Quantization of BERT
https://arxiv.org/pdf/1909.05840.pdf
6-7x reduction in memory compared to BERT baseline with a 4/8-bit (weights/embedding) group-wise quantization scheme w/minimal drop in performance
Transformer based architectures have become de-facto models used for a range of Natural Language Processing tasks. In particular, the BERT based models achieved significant accuracy gain for GLUE tasks, CoNLL-03 and SQuAD. However, BERT based models have a prohibitive memory footprint and latency. As a result, deploying BERT based models in resource constrained environments has become a challenging task. In this work, we perform an extensive analysis of fine-tuned BERT models using second order Hessian information, and we use our results to propose a novel method for quantizing BERT models to ultra low precision. In particular, we propose a new group-wise quantization scheme, and we use a Hessian based mix-precision method to compress the model further. We extensively test our proposed method on BERT downstream tasks of SST-2, MNLI, CoNLL-03, and SQuAD. We can achieve comparable performance to baseline with at most 2.3% performance degradation, even with ultra-low precision quantization down to 2 bits, corresponding up to 13Ă compression of the model parameters, and up to 4Ă compression of the embedding table as well as activations. Among all tasks, we observed the highest performance loss for BERT fine-tuned on SQuAD. By probing into the Hessian based analysis as well as visualization, we show that this is related to the fact that current training/fine-tuning strategy of BERT does not converge for SQuAD.
Impact of Accuracy on Model Interpretations
Model interpretations are often used in practice to extract real world insights from machine learning models. These interpretations have a wide range of applications; they can be presented as business recommendations or used to evaluate model bias. It is vital for a data scientist to choose trustworthy interpretations to drive real world impact. Doing so requires an understanding of how the accuracy of a model impacts the quality of standard interpretation tools. In this paper, we will explore how a modelâs predictive accuracy affects interpretation quality. We propose two metrics to quantify the quality of an interpretation and design an experiment to test how these metrics vary with model accuracy. We find that for datasets that can be modeled accurately by a variety of methods, simpler methods yield higher quality interpretations. We also identify which interpretation method works the best for lower levels of model accuracy.
More interesting work on self-supervised learning:
GradAug: A New Regularization Method for Deep Neural Networks
We propose a new regularization method to alleviate over-fitting in deep neural networks. The key idea is utilizing randomly transformed training samples to regularize a set of sub-networks, which are originated by sampling the width of the original network, in the training process. As such, the proposed method introduces self-guided disturbances to the raw gradients of the network and therefore is termed as Gradient Augmentation (GradAug). We demonstrate that GradAug can help the network learn well-generalized and more diverse representations. Moreover, it is easy to implement and can be applied to various structures and applications. GradAug improves ResNet-50 to 78.79% on ImageNet classification, which is a new state-of-the-art accuracy. By combining with CutMix, it further boosts the performance to 79.67%, which outperforms an ensemble of advanced training tricks. The generalization ability is evaluated on COCO object detection and instance segmentation where GradAug significantly surpasses other state-of-the-art methods. GradAug is also robust to image distortions and FGSM adversarial attacks and is highly effective in low data regimes. Code is available at GitHub - taoyang1122/GradAug: [NeurIPS'20] GradAug: A New Regularization Method for Deep Neural Networks
This looks super interesting, they focus on images, but donât see why this couldnât be used for any data type right?
I found this blogpost with short description of paper really interesting and relevant to all DL fields https://mohammadpz.github.io/GradientStarvation.html#
Sharpness-Aware Minimization for Efficiently Improving Generalization
New ImageNet SOTA with improved generalization capabilities and robustness to label noise.
Abstract:
In todayâs heavily overparameterized models, the value of the training loss provides few guarantees on model generalization ability. Indeed, optimizing only the training loss value, as is commonly done, can easily lead to suboptimal model quality. Motivated by the connection between geometry of the loss landscape and generalization â including a generalization bound that we prove here â we introduce a novel, effective procedure for instead simultaneously minimizing loss value and loss sharpness. In particular, our procedure, Sharpness-Aware Minimization (SAM), seeks parameters that lie in neighborhoods having uniformly low loss; this formulation results in a min-max optimization problem on which gradient descent can be performed efficiently. We present empirical results showing that SAM improves model generalization across a variety of benchmark datasets (e.g., CIFAR-{10, 100}, ImageNet, finetuning tasks) and models, yielding novel state-of-the-art performance for several. Additionally, we find that SAM natively provides robustness to label noise on par with that provided by state-of-the-art procedures that specifically target learning with noisy labels.
Super interesting, thank you! The mathy bits go a bit over my head, but the approach makes intuitive sense and the results look impressive. Efficiency and generalization are things we can always use more of
The idea of searching for âflatâ minima can be traced back to Hochreiter and Schmidhuber (1995)
Why does this not surprise me
On the topic of generalization
This article talks about a huge paper, also from Google, where they investigate why models often generalize so badly to real life. I can highly recommend reading it!
Canonical Capsules: Unsupervised Capsules in Canonical Pose
Weiwei Sun 1 4 * Andrea Tagliasacchi 2 3 * Boyang Deng 3 Sara Sabour 2 3
Soroosh Yazdani 3 Geoffrey Hinton 2 3 Kwang Moo Yi 1 4
1 University of British Columbia 2 University of Toronto
3 Google Research 4 University of Victoria * Equal contributions
Sharpness-Aware Minimization for Efficiently Improving Generalization
it might be interesting too and timm implementation also available
seams giving good results
Thanks!
8 posts were merged into an existing topic: Good Readings 2021