Help! Chapter 3: Full MNIST: single layer performs better than 3-layers

Hi!

I am doing a Learner for MNIST from scratch. The problem I am facing is that a single-layer model performs much better than a multi-layer model. With a single layer, I was able to achieve 0.63 correct predictions after 500 epochs with L1 norm as a loss function and learning rates 0.1 - 0.01. However, the multilayer model gets at most 0.2 accuracies even after thousands of epochs. I was modifying the learning rates and sizes of the second linear layer, but the result didn’t get better.

class LinearModel:
  def __init__(self, w, b):
    self.w = torch.rand(w).requires_grad_()
    self.b = torch.rand(b).requires_grad_()

  def params(self):
    return (self.w, self.b)

  def predict(self, xb):
    return xb@self.w + self.b


class RectifiedLinearUnit:
    def predict(self, xb):
      return xb.max(tensor(0.0))

    def params(self):
      return []


class NeuralNetwork:
  def __init__(self):
    self.layers = [
        LinearModel((28 * 28, 100), 10),
        RectifiedLinearUnit(),
        LinearModel((100, 10), 10),
    ]

  def params(self):
    l = [layer.params() for layer in self.layers]
    return [p for layer in l for p in layer]
   

  def predict(self, xb):
    res = xb
    for layer in self.layers:
        res = layer.predict(res)
    return res

I thought it is something wrong with my model and tried build in versions

torch.nn.Sequential(
   torch.nn.Linear(28*28,10)
)
and 
torch.nn.Sequential(
   torch.nn.Linear(28*28,10),
   torch.nn.ReLU(),
   torch.nn.Linear(10,10)
)

But got the same result.

As I understood, multi layer model should perform better and require less training but for me it is absolutely the opposite. Could someone please help me to understand why this happens?