I have been playing around with an MNIST classifier for a few weeks, and noticed in Lecture 7 that Jeremy did a resnet-like MNIST model that is similar to what I have been doing. So I thought I’d share it here.
In my model, I reliably get an accuracy on the test set of 99.82 percent. This is with a committee of 35 classifiers, where I average their softmax outputs to come up with a predicted probability distribution.
The individual nets average 99.73 percent accuracy, and the committee is formed by taking n of them with no selection criteria whatsoever.
So for the 10 thousand element test set, a 99.82 percent accurate classifier has 18 misclassifications. Here is a typical run:
Misclassifications

Interestingly, I think all or nearly all of these are genuine misclassifications. In other words, I agree with ground truth here. Most of the characters are marginal and I can certainly see why my classifier chose what it did, but the ground truth labels are, in my opinion, subtly perceptive and correct. So there is still room for improvement!
I’ve posted the business end of my model below, which you can plug into Jeremy’s notebook. I use one-cycle training with ten epochs total for a model, with a slight variation on the momentum schedule.
My data augmentation is different, and I think that’s what pushes this classifier beyond any published result I’ve seen. (The best I’ve seen published is 99.79 percent–I’m interested if anybody has seen better than that.)
For augmentation, I use a small amount of random elastic distortion, a small amount of random rotation (within plus or minus 10 degrees), and I randomly crop the 28x28 images to a 25x25 size, then resize back up to 28x28 (this has the effect of translating the image a little and also thickening it as a side effect).
At inference time, I do the same augmentation except for the crop part I do a single 25x25 center crop (with resize back to 28x28).
The model itself is the same idea as Jeremy’s: inspired by resnet except not as many layers. Also, the skip connections surround a single conv layer. I err on the side of too many batchnorms. There is a small dense layer at the end (about 5000 parameters). I use no dropout, no weight decay, and the Adam optimizer with a one_cycle schedule.
There are about 100x more parameters in my model than in Jeremy’s, even though the layer structure is similar. My convolutions just output many more layers. Still, each net trains in about 2 minutes on my '1080ti.
def mnist_model():
return nn.Sequential(
nn.Conv2d(in_channels=1, out_channels=128, kernel_size=5, padding=2),
nn.ReLU(),
Residual(128),
nn.MaxPool2d(2),
Residual(128),
nn.BatchNorm2d(128),
nn.Conv2d(128, 256, 3, padding=1),
nn.ReLU(),
nn.MaxPool2d(2),
Residual(256),
nn.BatchNorm2d(256),
nn.Conv2d(256, 512, 3, padding=1),
nn.ReLU(),
nn.MaxPool2d(2, ceil_mode=True),
Residual(512),
nn.BatchNorm2d(512),
nn.AvgPool2d(kernel_size=4),
Flatten(),
nn.Linear(512,10),
# Softmax provided during training.
)
class Residual(nn.Module):
def __init__(self, d):
super().__init__()
self.bn = nn.BatchNorm2d(d, **bn_params)
self.conv3x3 = nn.Conv2d(in_channels=d, out_channels=d, kernel_size=3, padding=1)
def forward(self, x):
x = self.bn(x)
return x + F.relu(self.conv3x3(x))
class Flatten(nn.Module):
def forward(self, x):
return x.view(x.size()[0], -1)