Thread for Blogs (Just created one for ResNet)

pierreguillou · November 15, 2017, 3:58pm

Just one comment : in my understanding, you must use cycle_save_name in learn.fit() if you want to save weights after each cycle_len and get at the end the average of weights.

jamesrequa · November 15, 2017, 8:42pm

Is that the way it works tho, does cycle_save_name give you the average of the weights? Or does it save the one that has the minimum validation loss (i.e. like the Keras model checkpoint callback)? cc @jeremy

jeremy · November 15, 2017, 9:34pm

It saves after every cycle. It’s up to you to load them up and average them. See planet_cv.ipynb for an example.

jamesrequa · November 15, 2017, 10:15pm

So is this similar to the effect of snapshot ensembling then if you were to use cycle_save_name and then took the average of the preds generated from each of those saved weights from each cycle with the idea that perhaps each of these found some unique local minima and thus extracted some unique information? So it would follow that this would possibly give you a better result than just choosing one of those saved weights because it had the minimum validation loss?

jeremy · November 15, 2017, 11:08pm

Yes it’s exactly that

apil.tamang · November 16, 2017, 2:22am

Beg to differ, and even sorrier I haven’t actually tried this out, but wanted to chime in…

I wonder if taking the average of the weights would be a good idea in an ensemble predictions. It makes sense to take the final predictions, and average them. However, taking the average of the weights… umm… that’s a little counter intuitive at least to me.

I feel like for any model trained to a point, the weights are optimized in relation to neurons within the neural network. I strongly feel like taking the averages of these weights wouldn’t necessarily translate in a linear way. I.e. the final performance of the network with the average weights wouldn’t be the final performance of the averaged predictions (that is: using ensemble in the traditional way).

But I could be wrong. I didn’t even know this wasn’t used by default. Just my 2 cents

jamesrequa · November 16, 2017, 2:33am

I think by “average of weights” in this case means loading the weights and predicting with each individually then taking the average of those predictions, not taking an average of the actual weights themselves. Yea I agree that would be kinda strange lol

Btw here is a paper about snapshot ensembling which explains this concept in a lot more detail. Basically the point is that we can implement this technique with fastai by using cycle_save_name That was the real “aha” moment for me and I’m excited to test it out.

"We show in a series of experiments that our approach is compatible with diverse network architectures and learning tasks. It consistently yields lower error rates than state-of-the-art single models at no additional training cost, and compares favorably with traditional network ensembles"
https://arxiv.org/abs/1704.00109

radek · November 16, 2017, 12:49pm

I think that taking an average of weights is also a valid approach, even though we have nonlinearities. Think about dropout for example - it is exactly what it relies on. Yeah, it gives you lesser dependency between activations (one nice effect) while it also effectively trains exponentially many models averaged at runtime

Definitely looking to further voices in this discussion and will gladly stand corrected if wrong Interesting conversation

jeremy · November 16, 2017, 5:45pm

I’d be very surprised if that worked, but I can’t say I’ve tried it.

bushaev · November 16, 2017, 8:06pm

I know I’m kinda late to the party but I also wrote a post about cyclic learning rates

ramesh · November 16, 2017, 8:11pm

That’s an awesome blog post with tons of references. Thanks for sharing.

mark-hoffmann · November 16, 2017, 10:17pm

I also just wrote my first deep learning blog post. I’d appreciate any feedback and improvements people may have as I hope to start making this a regular habit!

jeremy · November 16, 2017, 11:40pm

Wow that was absolutely worth the wait!

jeremy · November 16, 2017, 11:41pm

I hope you do too

thunderingtyphoons · November 17, 2017, 12:36am

That was a great post! Keep them coming

manikanta_s · November 19, 2017, 7:07am

Hi all,

I wanted to break my initial inertia in writing blogs. So, I came up with my first simple blog post. I am attaching the draft here. Please review and let me know. And my fellow bloggers please let me know the tools you use for technical blogs(to draw diagrams etc).

Transfer Learning

miguel_perez · November 19, 2017, 11:10am

Very clear, nice to read, informative post!

Only thing I that maybe, only maybe, could be a bit more precise; When explaining the criteria to decide if to “unfreeze” only last layers, medium or all layers the weight is put equally on “size of dataset” and “how similar this dataset is to originally trained one”. I’ve thought a bit about this… and I think its more important this second, similarity aspect. (Even if you have a very small dataset if this dataset is completely different to the original one you will need to unfreeze). But maybe this precission is unnecesary when first explaining concept.

Anyway, congrats for very good post post!

manikanta_s · November 19, 2017, 5:29pm

Thanks a lot for the review, @miguel_perez.

I think if we have a small dataset which is not in correlation with the original dataset, training all the layers may not yield a good result. Because the data will not be sufficient for the whole architecture to learn the ideal weights. This is my current understanding, please correct me if I am wrong.

anandsaha · November 19, 2017, 5:54pm

Here is the guidance from cs231n:

miguel_perez · November 19, 2017, 6:20pm

Well, I am quite noob so don’t trust my opinion 100%…

About cuestion of unfreezing, the way I see it, it makes the model more powerful ==>more overfitter what is bad if you have a small dataset. So it is true that small dataset it is a reason not to unfreeze.

But my point is, well imagine you have a very small dataset that is so different to the one used for the pretrained model that the pretrained weights are worthless. You will have to unfreeze that dataset, no matter if it is very small