Chapter4: Full MNIST from Scratch - help

I have been banging my head against a wall for a few days on this now. I finished chapter 5 and decided to go back and have a crack at the full MNIST from scratch problem, and implement softmax and NLL manually to help my understanding, however i’ve hit a few snags.

First of all when i initialize the parameters with an std of 1.0 using this code:

def init_params(size, std=1.0): return (torch.randn(size)*std).requires_grad_()
w1 = init_params((28*28,30))
b1 = init_params(30)
w2 = init_params((30,10))
b2 = init_params(10)
params = w1,b1,w2,b2

and then run this:

def train_epoch(model, lr, params):
    for xb,yb in train_dl:
        calc_grad(xb,yb, model)
        for p in params:
            print(p)
            opt.step()
            opt.zero_grad()

The second params which is b1 returns:

tensor([nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan], requires_grad=True)

and i get

<ipython-input-79-20ce620bb67a> in step(self, *args, **kwargs)
     22     def step(self, *args, **kwargs):
     23         print()
---> 24         for p in self.params: p.data -= p.grad.data * self.lr
     25 
     26     def zero_grad(self, *args, **kwargs):

AttributeError: 'NoneType' object has no attribute 'data'

If i change std to 0.5 i get slightly different behavior:

def init_params(size, std=.5): return (torch.randn(size)*std).requires_grad_()

w1 = init_params((28*28,30))
b1 = init_params(30)
w2 = init_params((30,10))
b2 = init_params(10)
params = w1,b1,w2,b2


def train_epoch(model, lr, params):
    for xb,yb in train_dl:
        calc_grad(xb,yb, model)
        for p in params:
            print(p)
            opt.step()
            opt.zero_grad()

when i call train_epoch with the lower std value the b1 tensor appears ok:

tensor([-0.7091, -0.2209, -0.4982, -0.1451, -0.0528, -0.0409, -0.3094,  0.3262,  0.0605,  1.4114,  0.5296, -0.1523, -0.1857, -0.2357, -0.0586, -0.0773, -0.0062, -0.1788,  0.5182, -0.0185, -0.4281,
        -0.1036,  0.2513,  0.3971, -0.2471, -0.0673,  0.4250,  0.0252, -0.9616, -0.1530], requires_grad=True)

but i still get the same error afterward:

AttributeError: 'NoneType' object has no attribute 'data'

i think perhaps ive made a dogs breakfast out of my code but ive been banging my head against the wall so long my head is spinning. Here’s my code:

Hi, it sounds to me that you putted the opt step under a for p in parameters. But the step also does this iteration.

1 Like

Yes! Well spotted mate, thank you.

edit: actually this hasn’t fixed my issue.

With std in init_params at 1.0 my loss is still NaN, strangely if set that to 0.5 my loss returns a numeric value but my model doesn’t improve. I am at a loss (haha) as to where i’ve gone wrong

The problem seems related to your softmax, if you try to get preds and see torch.exp(preds) you see that you have some infinite.
Maybe this is the reason why torch in crossentropy doesnt take the softmax and then make the log, but instead it calculates softmax and log together to have stability (I think).
Try to use some torch funtcions (i.e. nn.Parameters, or log_softmax()) to see exactly what makes your model to fail.

1 Like

I think you are onto something. I notice here:

def NLL(preds, targets):
    preds = SOFTMAX(preds)
    idx = range(len(targets))
    truth = preds[idx, targets]
    print(truth[~torch.log(truth).isfinite()])

returns a bunch of tensor(0) which makes sense because log(0) is -infinite.

The problem appears to be that my SOFTMAX function returns 0 when the input values are moderating big or small (under -100 and above +100).

torch.softmax appears to behave the same tho:

torch.softmax(tensor([[-100.,90.]]),dim=1)
#returns: tensor([[0., 1.]])

torch.softmax(tensor([[-2.,3.]]),dim=1)
#returns: tensor([[0.0067, 0.9933]])

I’m not sure how to refactor to ensure the softmax never returns 0.

I suspect i have a fundamental misunderstanding about this is supposed to work :thinking:

yes, look how stability changes using different functions:

torch.exp(tensor([[-100.,90.]]))/torch.exp(tensor([[-100.,90.]])).sum(dim=1,keepdim=True) = tensor([[0., nan]])

torch.softmax(tensor([[-100.,90.]]),dim=1) = tensor([[0., 1.]])

torch.softmax(tensor([[-100.,90.]]),dim=1).log() = tensor([[-inf, 0.]])

torch.log_softmax(tensor([[-100.,90.]]), dim=1) = tensor([[-190., 0.]])

so I would say last function is what I would use.

And yes, the problem is related to how “big” is your final activations, so how big is your parameters

Thanks mate that has definitely helped. I found this today which i tried to use in my softmax function but i couldn’t get it work. I think i will just stick to the functions provided by pytorch.