Full MNIST - Chapter 4 Further Research - Without softmax

Hey there!
While trying to replicate the fastbook chapter 4 for the complete MNIST dataset, is it that we need to use the softmax or cross entropy loss? I’m trying to do so without the them and I’m getting results of around ~86% for the classification using non-linearity.

I’m not sure if the chain of functions is a differentiable (for SGD). Am I missing a trick here? Would love to get an opinion! I’ve tried to use broadcasting and pytorch functions as much as possible to avoid using the slow python loops.

