Research Paper Recommendations

I found this paper on “Curriculum dropout”: https://arxiv.org/abs/1703.06229

Skimming it (SKIMMING!), it looks like they increase dropout but never decrease it.

Has anyone tried to reduce BPTT + increase bs , gradually, for LM backed classifiers? It seemed to work quite well for me, but I only tried it on Quora dataset so far.
Initially, I dismissed this thinking that there are fewer pad tokens with smaller bptt for a dataset like Quora where the length of an item is approx 20(mean+std Dev).
So I tried training with a fixed but small bptt around 20 and tried:

  1. Steadily increasing bs
  2. Fixed large bs
    Neither worked as well as the case where I gradually reduced bptt and increased bs.
    Matrix has to go from (short-stout) to (tall-lean). The LM was trained at bptt 20 and the classifier started from bptt 50 and came down to 20 towards the end of training.

Leslie Smith tried cycling weight decay and that didn’t work as well as a constant rate decay. We’ve recently seen learning rate, momentum, image size, and data augmentation work well in cycles and it seems reasonable to think that dropout would also do well cycling; I wonder why cycling weight decay didn’t give an advantage over fixed weight decay.

Great thread. Thanks for starting this.

I have been tracking Jeremy’s Mendeley library for what paper to read for this course.

3 Likes

Dmitry Ulyanov presented his work on Deep Image Prior at the London Machine Learning Meetup last night. The idea is extremely intriguing: to perform de-nosing, super-resolution, inpainting etc. using an untrained CNN. It would be interesting to try this out using the fast.ai library.
My understanding of the process is the following.

  1. You start with a conventional deep CNN, such as ResNet or UNet, with completely randomised weights.
  2. The input image is a single fixed set of random pixels.
  3. The target output is the noisy/corrupted image
  4. Run gradient descent to adjust the weights, with the objective of matching the target.
  5. Use early stopping to obtain desired result, before over-fitting occurs.

image

I am sorry for the confusion on the 1cycle policy. It is one cycle but I let the cycle end a little bit before the end of training (and keep the learning rate constant at the smallest value) to allow the weights to settle into the local minima.

9 Likes

It reminds me of the optimization done for style transfer. I’m experimenting with it for image restoration. What I found is that I’d trained on a pair of dark/light images, the network learns the structures in the image. Hence when you run a new unseen dark image through the network, it has trouble correcting the colors on any new structure in the image. It is quite a cool paper and I’m still playing with it.

You are right that deep image prior (DIP) shares with style transfer the idea of creating a loss function that matches characteristics across two different images. However, style transfer makes use a pre-trained network, whereas DIP does not. You should certainly not expect DIP to generalise from one image to another. You have to optimise separately for each example, which is a disadvantage, given how long it seems to take (as acknowledged by the authors). I’ve contacted Dmitry, asking him if DIP still works using smaller CNNs. I also suggested that an entropy-based measure of the output image could be useful in deciding when to stop training and to avoid fitting noise.

I’d like to point out a paper. Take a look at:

Hestness, Joel, Sharan Narang, Newsha Ardalani, Gregory Diamos, Heewoo Jun, Hassan Kianinejad, Md Patwary, Mostofa Ali, Yang Yang, and Yanqi Zhou. “Deep Learning Scaling is Predictable, Empirically.” arXiv preprint arXiv:1712.00409 (2017).

This is a nice paper by a team from Baidu Research. Section 5 is particularly practical.

3 Likes

This looks like a nice project for anyone with time to enhance the fast.ai framework - google’s AutoAugment paper https://arxiv.org/abs/1805.09501

In short, the idea is an algorithm that learns what augmentations work and do not work for any given data set. e.g. shearing and inversion work particularly well on street/house number images.

I routinely use a lot more augmentation than shown in the lectures, n=24 is the starting point, but always feel a little guilty as to the blind luck throwing of darts. So it is nice to see some research on what works where and why.

This one looked interesting and it seems quite close to what is shown in part 2: https://arxiv.org/abs/1805.07932

I started to kind of try and use it in fastai, really inspired by Jeremy’s notion that we can pass any pytorch model to the learner. But I have to admit I got lost in the code at some point. Either this is quite difficult to understand or their code is a bit too detailed. In any case, might be an interesting model for an interesting task.

Regards,
Theodore.

Breaking the Softmax Bottleneck: A High-Rank RNN Language Model

https://arxiv.org/abs/1711.03953

1 Like

Where does Jeremy show this?

The Lottery Ticket Hypothesis: Finding Small, Trainable Neural Networks
https://arxiv.org/abs/1803.03635

The paper propose how to prune unnecessary weights of a network in a way that the resulting network is 20%-50% of original size, while converging 7x faster and improving accuracy. Seems too good to be true and I am very interested in trying it with fast.ai/pytorch.

1 Like
1 Like

There is a new, thought-provoking paper on the current sloppiness of ML research:
Troubling Trends in Machine Learning Scholarship
https://arxiv.org/abs/1807.03341

2 Likes

Thanks

Quick question, I think there was a mention from @jeremy that after ULMfit paper; Deepmind utlized ulmfit and stacked with some additional algorithm? Did I hear it right, what is that paper?

This looks pretty interesting: https://arxiv.org/abs/1810.12890
They apply dropout to regions rather than single activations in CNNs. They argue that dropping only single activations is not sufficient as the activations are spatially correlated and the information still “flows through”.
I have not read it in detail yet, but it looks pretty neat, also including comparisons to other dropout papers.

I wonder if it makes sense to apply drops mainly around the objects in question (in tasks where the object locations are known) :thinking:

Here is presentation of the CornerNet, supposed to be the new state of the art instead of SSD:




I found his ideas really great

1 Like