DL Optimizers

kcturgutlu · December 6, 2017, 8:53am

Hello,

After reading this great blog post http://ruder.io/optimizing-gradient-descent/ and seeing fascinating new optimizer implementations in Fast.ai. I wanted to give it a chance and started implementing them with PyTorch from scratch. I used MNIST dataset to benchmark different optimizers. So far, I’ve finished SGD, SGD Momentum and Nesterov. But Nesterov doesn’t converge faster than SGD Momentum as expected. There might be an issue about my code but I couldn’t find anyhing. Hope that someone can help me out with this.

Here I am sharing the github page:

github.com

KeremTurgutlu/deeplearning/blob/master/Exploring Optimizers.ipynb

{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Classical\n",
    "import pandas as pd\n",
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "import matplotlib\n",
    "\n",
    "# PyTorch\n",
    "import torch\n",
    "from torch import nn\n",
    "import torch.nn.functional as F\n",
    "from torch.autograd import Variable\n",
    "from torch.utils.data import Dataset, DataLoader\n",

This file has been truncated. show original

Thanks in Advance

jeremy · December 6, 2017, 6:27pm

What a great idea - and thanks for sharing. I don’t particularly find nesterov faster, FYI; it depends on the dataset. One test of your code would be to try the pytorch optim implementation on your dataset with and without nesterov, and see if your performance is the same.

jeremy · December 6, 2017, 6:59pm

Also, I like your clear approach of showing a basic Dataset, and basic fully connected net. Really well done. Here’s a slight refactoring of your forward() BTW (untested):

def forward(self, x):
    for lin in self.linears:
        lin_x = lin(x)
        x = F.relu(lin_x)
    return F.log_softmax(lin_x)

(Also - I forgot to mention, in the code you showed me yesterday I think you forgot to include F.relu.)

kcturgutlu · December 6, 2017, 8:07pm

Hello,

Thanks for the feedback, I should definitely cross check my results with PyTorch. That way I can be sure, but I assumed a smarter ball to be always faster than a regular ball with momentum. Nevertheless, I shouldn’t be assuming as you’ve been telling us

For the second part, autoencoder project, I didn’t include any non-linearity since I read this in Kaggle discussion:

Question: Cool - thank you You mentioned you have used linear activation function - e.g. - just input and weight multiplication Did you use it for all (any specific reason?) or only for bottleneck layer?

And this layer activations - did you just concat them in one huge dataset as you mentioned of 1-10k dimension?

Michael: I recommend linear activation in the middle layer of bottleneck setup because relu truncate the values <0. Yes just concat to a long feature vector. Here for a deep stack DAE 221-1500-1500-1500-221 you get new dataset with 4500 features.

But definitely I can try both with and without activation as a step of ablation study. What would you say to that ?

Thanks so much !

jeremy · December 6, 2017, 8:24pm

Just the middle layer, apparently?

kcturgutlu · December 7, 2017, 4:57am

It’s very interesting before normalizing data all Nesterov was behaving very awkward but after normalizing it seems to be right. It probably highly depends on the loss curve and what kind of function we are trying the optimize. I actually wouldn’t notice that I’ve forgotten to normalize my data if it wasn’t for Adagrad’s way of saying ‘Hey something is wrong with your setup’

Nevertheless, it’s also very cool to see how each optimizer has it’s own unique nature and how cumbersome it would have been to find good ways of optimizing our loss function if it wasn’t for the SOTA techniques we are readily using in Fast.ai. Again, appreciated

Note: I’ve updated the notebook in my github repo for those who would like to check out, recommend anyhing or just have a glimpse at it More will come after exams, need to study some TS and ML too…

Update: SGD, SGD Momentum, Nesterov, Adagrad, RMSProp, Adam, Adamax are all done and cross checked with PyTorch + Nadam (Currently not available in optim module). They seem solid, so you can check the implementations without any hesitation, thanks !

anandsaha · December 7, 2017, 9:30am

This makes for excellent learning material, thank you!

kcturgutlu · December 7, 2017, 9:31am

No, thank you, because you are one of the people who inspired this work

thiago · December 7, 2017, 10:42am

Wows, It’s amazing! Definitely, I’ll use it to study.

Thanks @kcturgutlu

ravivijay · December 7, 2017, 7:32pm

Great stuff @kcturgutlu!

jeremy · December 7, 2017, 10:12pm

Just finished writing lesson 7. Some of it will look quite familiar…

ravivijay · December 8, 2017, 2:13am

@kcturgutlu Are you using some custom functions ?
I get below error

kcturgutlu · December 8, 2017, 2:27am

Hey Ravi that code is not intended to run but rather to show the algotihm. So you can ignore parts with ### Algorihtm header. NB raw was not visually good on github, so I left it as a code chunk.

ravivijay · December 8, 2017, 6:06am

Okay. I was seeing if I can try other optimizers.

Chris_Palmer · April 25, 2018, 9:03pm

Hi Kerem

Looks interesting, can you change your link to what I assume is now the correct one at https://github.com/KeremTurgutlu/deeplearning/blob/master/study/Exploring%20Optimizers.ipynb

Love to know if you ever find a way to use the built-in Pytorch MNIST data - rather than have to convert to CSV, for loading into FASTAI

kcturgutlu · April 25, 2018, 9:23pm

Here is the new link:

github.com

KeremTurgutlu/deeplearning/blob/master/study/Exploring Optimizers.ipynb

{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Classical\n",
    "import pandas as pd\n",
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "import matplotlib\n",
    "\n",
    "# PyTorch\n",
    "import torch\n",
    "from torch import nn\n",
    "import torch.nn.functional as F\n",
    "from torch.autograd import Variable\n",
    "from torch.utils.data import Dataset, DataLoader\n",

This file has been truncated. show original

I haven’t look into MNIST in PyTorch but data is available in Kaggle.

Thanks

Chris_Palmer · April 25, 2018, 9:40pm

Thanks

Chris_Palmer · April 26, 2018, 9:01pm

I am working out how to use this - thanks very much - it is such a nice example of clear thinking and understanding!

Can you please give me a reference or explanation for why you are using 256 to normalise the data instead of 255?

kcturgutlu · April 26, 2018, 9:17pm

It might be typo pixel values range between 0-255

Chris_Palmer · April 26, 2018, 9:51pm

Oh, OK - I won’t be puzzled any more