Difficulties training a neural network on the complete MNIST dataset

Hello, I am currently going through this series of lectures along with the textbook. I was working on Q2 of the “Further Research” section in chapter 4 of the textbook (training a neural network on the full MNIST dataset). I am currently in a situation where the loss on both the training set and the validation set is decreasing, each epoch, but the accuracy more or less stays the same at a very low number around 0.1.

I have consulted other solutions posted by members of this forum, but not sure where my solution is off. I have posted my code along with comments here. I would appreciate if anyone could point out what the problem is / why it is occurring. Thanks!

Looks like the version of the code I posted does not include train_loss and valid_loss (I was previously using the built in Learner class which showed those two values).

I will paste the result I get when I run my code with the built in Learner:

epoch	train_loss	valid_loss	batch_accuracy	time
0	0.515939	0.512917	0.108500	00:00
1	0.507944	0.503949	0.103000	00:00
2	0.498174	0.493175	0.103200	00:00
3	0.486648	0.480698	0.103900	00:00
4	0.473775	0.466921	0.105300	00:00
5	0.459645	0.451923	0.107300	00:00
6	0.444157	0.435620	0.115000	00:00
7	0.427215	0.417961	0.122200	00:00
8	0.408822	0.399008	0.132600	00:00
9	0.389109	0.378947	0.138200	00:00
10	0.368345	0.358084	0.139700	00:00
11	0.346908	0.336823	0.137100	00:00
12	0.325258	0.315624	0.131600	00:00
13	0.303888	0.294958	0.127600	00:00
14	0.283283	0.275261	0.125100	00:00
15	0.263865	0.256887	0.124300	00:00
16	0.245953	0.240079	0.122100	00:00
17	0.229740	0.224962	0.121200	00:00
18	0.215295	0.211551	0.121400	00:00
19	0.202581	0.199772	0.121700	00:00
20	0.191487	0.189498	0.121800	00:00
21	0.181859	0.180570	0.123500	00:00
22	0.173525	0.172825	0.124800	00:00
23	0.166315	0.166103	0.126400	00:00
24	0.160072	0.160257	0.126600	00:00
25	0.154652	0.155161	0.127100	00:00
26	0.149933	0.150701	0.128400	00:00
27	0.145808	0.146785	0.130100	00:00
28	0.142190	0.143331	0.130800	00:00
29	0.139001	0.140273	0.130600	00:00

The issue was my choice of loss function (l1 norm with sigmoid). Changing the loss function to nn.CrossEntropyLoss() solved the issue.

Based on my intuition, I did expect the the model to train with the l1 norm + sigmoid, albeit not as efficiently as cross entropy loss. However I noticed the accuracy does not increase at all… if someone knows the intuition behind this, would be happy to learn about it.