Leslie Smith tried cycling weight decay and that didn’t work as well as a constant rate decay. We’ve recently seen learning rate, momentum, image size, and data augmentation work well in cycles and it seems reasonable to think that dropout would also do well cycling; I wonder why cycling weight decay didn’t give an advantage over fixed weight decay.
Great thread. Thanks for starting this.
I have been tracking Jeremy’s Mendeley library for what paper to read for this course.
Dmitry Ulyanov presented his work on Deep Image Prior at the London Machine Learning Meetup last night. The idea is extremely intriguing: to perform de-nosing, super-resolution, inpainting etc. using an untrained CNN. It would be interesting to try this out using the fast.ai library.
My understanding of the process is the following.
- You start with a conventional deep CNN, such as ResNet or UNet, with completely randomised weights.
- The input image is a single fixed set of random pixels.
- The target output is the noisy/corrupted image
- Run gradient descent to adjust the weights, with the objective of matching the target.
- Use early stopping to obtain desired result, before over-fitting occurs.
I am sorry for the confusion on the 1cycle policy. It is one cycle but I let the cycle end a little bit before the end of training (and keep the learning rate constant at the smallest value) to allow the weights to settle into the local minima.
It reminds me of the optimization done for style transfer. I’m experimenting with it for image restoration. What I found is that I’d trained on a pair of dark/light images, the network learns the structures in the image. Hence when you run a new unseen dark image through the network, it has trouble correcting the colors on any new structure in the image. It is quite a cool paper and I’m still playing with it.
You are right that deep image prior (DIP) shares with style transfer the idea of creating a loss function that matches characteristics across two different images. However, style transfer makes use a pre-trained network, whereas DIP does not. You should certainly not expect DIP to generalise from one image to another. You have to optimise separately for each example, which is a disadvantage, given how long it seems to take (as acknowledged by the authors). I’ve contacted Dmitry, asking him if DIP still works using smaller CNNs. I also suggested that an entropy-based measure of the output image could be useful in deciding when to stop training and to avoid fitting noise.
I’d like to point out a paper. Take a look at:
Hestness, Joel, Sharan Narang, Newsha Ardalani, Gregory Diamos, Heewoo Jun, Hassan Kianinejad, Md Patwary, Mostofa Ali, Yang Yang, and Yanqi Zhou. “Deep Learning Scaling is Predictable, Empirically.” arXiv preprint arXiv:1712.00409 (2017).
This is a nice paper by a team from Baidu Research. Section 5 is particularly practical.