DL Optimizers


After reading this great blog post http://ruder.io/optimizing-gradient-descent/ and seeing fascinating new optimizer implementations in Fast.ai. I wanted to give it a chance and started implementing them with PyTorch from scratch. I used MNIST dataset to benchmark different optimizers. So far, I’ve finished SGD, SGD Momentum and Nesterov. But Nesterov doesn’t converge faster than SGD Momentum as expected. There might be an issue about my code but I couldn’t find anyhing. Hope that someone can help me out with this.

Here I am sharing the github page:

Thanks in Advance


What a great idea - and thanks for sharing. I don’t particularly find nesterov faster, FYI; it depends on the dataset. One test of your code would be to try the pytorch optim implementation on your dataset with and without nesterov, and see if your performance is the same.

Also, I like your clear approach of showing a basic Dataset, and basic fully connected net. Really well done. Here’s a slight refactoring of your forward() BTW (untested):

def forward(self, x):
    for lin in self.linears:
        lin_x = lin(x)
        x = F.relu(lin_x)
    return F.log_softmax(lin_x)

(Also - I forgot to mention, in the code you showed me yesterday I think you forgot to include F.relu.)

1 Like


Thanks for the feedback, I should definitely cross check my results with PyTorch. That way I can be sure, but I assumed a smarter ball to be always faster than a regular ball with momentum. Nevertheless, I shouldn’t be assuming as you’ve been telling us :slight_smile:

For the second part, autoencoder project, I didn’t include any non-linearity since I read this in Kaggle discussion:

Question: Cool - thank you You mentioned you have used linear activation function - e.g. - just input and weight multiplication Did you use it for all (any specific reason?) or only for bottleneck layer?

And this layer activations - did you just concat them in one huge dataset as you mentioned of 1-10k dimension?

Michael: I recommend linear activation in the middle layer of bottleneck setup because relu truncate the values <0. Yes just concat to a long feature vector. Here for a deep stack DAE 221-1500-1500-1500-221 you get new dataset with 4500 features.

But definitely I can try both with and without activation as a step of ablation study. What would you say to that ?

Thanks so much !

Just the middle layer, apparently?

1 Like

It’s very interesting before normalizing data all Nesterov was behaving very awkward but after normalizing it seems to be right. It probably highly depends on the loss curve and what kind of function we are trying the optimize. I actually wouldn’t notice that I’ve forgotten to normalize my data if it wasn’t for Adagrad’s way of saying ‘Hey something is wrong with your setup’ :slight_smile:

Nevertheless, it’s also very cool to see how each optimizer has it’s own unique nature and how cumbersome it would have been to find good ways of optimizing our loss function if it wasn’t for the SOTA techniques we are readily using in Fast.ai. Again, appreciated :slight_smile:

Note: I’ve updated the notebook in my github repo for those who would like to check out, recommend anyhing or just have a glimpse at it :wink: More will come after exams, need to study some TS and ML too…

Update: SGD, SGD Momentum, Nesterov, Adagrad, RMSProp, Adam, Adamax are all done and cross checked with PyTorch + Nadam (Currently not available in optim module). They seem solid, so you can check the implementations without any hesitation, thanks !


This makes for excellent learning material, thank you!

No, thank you, because you are one of the people who inspired this work :slight_smile:

1 Like

Wows, It’s amazing! Definitely, I’ll use it to study. :smiley:

Thanks @kcturgutlu

Great stuff @kcturgutlu!

Just finished writing lesson 7. Some of it will look quite familiar…


@kcturgutlu Are you using some custom functions ?
I get below error

Hey Ravi that code is not intended to run but rather to show the algotihm. So you can ignore parts with ### Algorihtm header. NB raw was not visually good on github, so I left it as a code chunk.

Okay. I was seeing if I can try other optimizers.

1 Like

Hi Kerem

Looks interesting, can you change your link to what I assume is now the correct one at https://github.com/KeremTurgutlu/deeplearning/blob/master/study/Exploring%20Optimizers.ipynb

Love to know if you ever find a way to use the built-in Pytorch MNIST data - rather than have to convert to CSV, for loading into FASTAI :slight_smile:

Here is the new link:

I haven’t look into MNIST in PyTorch but data is available in Kaggle.


Thanks :smile:

1 Like

I am working out how to use this - thanks very much - it is such a nice example of clear thinking and understanding!

Can you please give me a reference or explanation for why you are using 256 to normalise the data instead of 255?

It might be typo :slight_smile: pixel values range between 0-255

Oh, OK - I won’t be puzzled any more :wink: