I’m trying to do the cross entropy module from scratch (without batches for simplification) but I’m a bit lost with the grad calculation (you can find some notes about this here: https://www.ics.uci.edu/~pjsadows/notes.pdf for instance).
With the MNIST data, with the same layers as in the Part 2 Lessons 8 and 9 [ Linear([784, 50]), ReLU() and Linear([50, 10]) ] I calculate the loss with CrossEntropy and call backward(), but when the backward calls CrossEntropy::bwd, input.grad must be a [50000, 10] tensor and I don’t know how to do the derivative given:
- output: just a number
- input: a [50000, 10] tensor
- target: a  tensor
Here is my CrossEntropy module source code:
def logsumexp(x): m = x.max(-1) return m + (x - m[:,None]).exp().sum(-1).log() def log_softmax(x): return x - x.logsumexp(-1, keepdim=True) def nll(input, target): return -input[range(target.shape), target].mean() class CrossEntropy(Module): def forward(self, input, target): return nll(log_softmax(input), target) def bwd(self, output, input, target): aux = torch.argmax(input, dim=1) - target input.grad = ???
In aux we have the difference between the current output and the real output as a  tensor.
Here is the fit method (without batches) from the class Model():
def fit(self, x, target, epochs, lr=0.1): for epoch in range(epochs): out = self(x, target) loss = self.loss(out, target) self.backward() with torch.no_grad(): for layer in self.layers: if (hasattr(layer, '_parameters')): for p in layer._parameters.values(): p -= p.grad * lr p.grad.zero_()
Could someone help me with the CrossEntropy::bwd implementation?