Full MNIST - Chapter 4 Further Research - Without softmax

Hey there!
While trying to replicate the fastbook chapter 4 for the complete MNIST dataset, is it that we need to use the softmax or cross entropy loss? I’m trying to do so without the them and I’m getting results of around ~86% for the classification using non-linearity.

I’m not sure if the chain of functions is a differentiable (for SGD). Am I missing a trick here? Would love to get an opinion! I’ve tried to use broadcasting and pytorch functions as much as possible to avoid using the slow python loops.

Here is the link to the colab file:

Hi, trying to do same , trying not to use softmax or cross entropy loss. Going through your code, mnist loss function is confusing me.

def mnist_loss(preds, yb):
  preds = preds.sigmoid()
  return preds.where(torch.arange(10) != yb, 1-preds).mean()

shape of yb will be torch.size([256,1]) , so the result of pytorch.where function will be torch.size([256,10]) . is this right? if yes then mean of pytorch.where function result will be altered by big amount or i am wrong here?
Can you guide please?