I implemented Semantic Image Synthesis with Spatially-Adaptive Normalization (SPADE) by Nvidia which got state of the art results in Image to Image translation. It takes a segmentation mask and produces the colored image for that mask.
It is my first paper that I implemented completely from scratch and got promising results.
I’ve applied ULMFiT to several genomic datasets and shown improved performance over other published results. Currently working on a more long form writeup.
I was working on kaggle’s jigsaw unintended bias challenge and trained my model using the techniques learned from lessons 9 and 10. Here is my solution kernel. I have tokenized with keras because I am not experienced in nlp with pytorch. I will update my kernel as the time goes on
Today I was thinking about how you might go about discovering a better learning rate schedule.
The first step in my experimentation was exploring what’s going on with the relationship between the learning rate and the loss function over the course of training.
I took what we learned about callbacks this week and used that to run lr_find after each batch and record the loss landscape. Here’s the output training Imagenette on a Resnet18 over 2 epochs (1 frozen, 1 unfrozen) with the default learning rate. The red line is the 1 cycle learning rate on that batch.
And (via learn.recorder) the learning rate schedule, and loss for epoch 1 (frozen) and 2 (unfrozen):
I’m not quite sure what to make of it yet. I think maybe if you could have your learning rate schedule dynamically update to stay just behind where the loss explodes that might be helpful (in the unfrozen epoch I had my LR a bit too high and it clipped that upward slope and made things worse).
Unfortunately it’s pretty slow to run lr_find after each batch. Possible improvements would be running just a “smart” subset to find where the loss explodes and to only run it every n batches.
Edit: one weird thing I found was that pulling learn.opt.lr returns a value that can be higher than the maximum learning rate (1e-3 in this case) – not sure why this would be when learn.recorder.plot_lr doesn’t show the same thing happening.
Great work! Note that freezing doesn’t make sense for imagenette - you shouldn’t use a pretrained imagenet model, since the data is a subset of imagenet, and it doesn’t make much sense to freeze a non-pretrained model.
Here’s an updated animation showing 10 epochs with no pre-training (one snapshot of lr_find every 10 batches).
It stayed pretty much in the zone! So maybe there’s not actually that much room to improve the LR schedule.
It looks like 1e-3 (which is what lr was set at) would have been good but it overshoots it a bit according to learn.opt.lr – not sure if this is an issue with opt.lr or learn.recorder because they still don’t seem to match up.)
Note: in your for loops, you write y = a @ x. You should write y = x @ a (input x that is multiplies by the weight matrix to give the output y) I think.
In short, the authors introduce a new normalization technique for cases where we have 1-2 images/GPU, as BN does not perform well in that cases. They also used Group Norm. Weight Standardization normalizes the weights of the conv layers. They tested it for various computer vision techniques and they were able to achieve better results than before. But they did all their experiments with constant learning rate with annealing after some iters. The main argument is WS smoothenes the loss surface and normalizes the gradient in the backward pass.
So I tested out Weight Standardization for cyclic learning. In the blog post, I present comparisons of with and without weight standardization for a resnet18 model on CIFAR-10 dataset.
But after experimenting for a day, I was not able to get better results using WS. Although, when I use lr_find it kind of shows that I can use a larger learning rate, but when I train the models the results are quite similar. I think the added cost of WS does not justify the performance and for cyclic learning is not a good choice.
Also, if someone can comment on my new blog style. I first introduced the paper, and then showed the graphs for the results. Feedback on this approach would be appreciated.
Someone with experience on Medium, I need some help. When I go to publish my post, they give us an option saying, " Allow curators to recommend my story to interested readers. Recommended stories are part of Medium’s metered paywall.". I just want to keep my blog posts free, so should I use this option.
Here is a small Medium post summarizing the BERT training using LAMB paper that was introduced during Lecture 11.
As always, corrections and comments improving the style and content are welcome.
Ever since getting into deep learning, and making my first PR to pytorch last year, I’ve been interested in digging into what’s behind the scenes of the python wrappers we use, and understanding more about what’s going on at the GPU level.
The result was my talk " CUDA in your Python: Effective Parallel Programming on the GPU", which I had the chance to present at the PyTexas conference this past weekend.
I would love any feedback on the talk, as I’m giving it again at PyCon in ~3 weeks.