How can I approach the creation of a digit classifier?


I’m doing one of the challenges in chapter 4 of the fastbook in which a digit classifier has to be made using the MNIST dataset.

I’m a bit confused how I should go about creating this classifier. There will obviously be 9 different labels, but I’m not quite sure how I should go about assigning the labels. Below are some ideas of mine.

The first idea I have is essentially k-nearest neighbors. I calculate the linear combination for all images in the training set, calculate the linear combination for the input image, and then use the majority label among k different training samples whose linear combinations are closest to the input image. The problem I have with this is that I don’t quite see how the system could improve itself overtime with this approach, and how I could add nonlinearity to this. However, I suppose this approach could be useful as a baseline.

The second idea I have is to use the softmax function. I’ve never used it, but know what it is: it’s an extension of the sigmoid function for multiclass classification. Where the sigmoid function can classify two labels (is an input larger than or less than zero), the softmax function can classify to multiple labels (is a value in between two set values).

However, I’m not quite sure how to set the intervals for which to classify a digit (e.g., using arbitrary values: a value between 0.0 and 0.1 is classified as a 0, a value between 0.1 and 0.2 is classified as a 1, etc.). One idea I have is that I could calculate the average linear combination of all digits respectively in the training set, then pass it through the sigmoid function:

{0: tensor([0.4247]),
1: tensor([0.4462]),
2: tensor([0.4960]),
3: tensor([0.5454]),
4: tensor([0.4937]),
5: tensor([0.5575]),
6: tensor([0.5382]),
7: tensor([0.4850]),
8: tensor([0.4669]),
9: tensor([0.5285])}

Then I could set the intervals with these values (e.g., a 0 label is given to any combination between 0.4247 and 0.4462; a 9 label is given to any combination between 0.5285 and 0.5382; etc.). Though I don’t think this would work because these are only average combinations and there would definitely be, for example, a 9 image input, and after passing the combination through a sigmoid, would be below 0.5285.

I don’t have any other ideas at the moment for how to approach classifying the inputs.

I would really appreciate any pointers, tips, and other ideas to approach this! I am kind of confused/lost.

Classifers generally produce a probability for each class so the output would be a 10 element tensor. Each element represents to probability of that class so 0 = prob 0.1 1 = prob 0.05 … 9= prob 0.99.
Obvioiusly it is the digit 9. You need to adjust the probabilities so one stands out more than the others. Please see page 169 for the mnist_loss function.
Regards Conwyn

1 Like

Thank you for the response!

How do I figure out though what interval corresponds to what digit though?

In the mnist chapter, 3s and 7s are classified, and 0.5 is arbitrarily chosen as the interval with no explanation. Any linear combination above 0.5 is a 3 and any below is a 7, without calculating any probability of sorts.

This is what I’m getting stuck over: how do we decide the intervals and the probabilities that decide which interval a digit will be assigned?

If I asked you for a weather prediction for today snow, rain, cloudy and sunshine, you would look out of your window and conclude 1/10 for snow 2/10 for rain 3/10 for cloudy and 4/10 for sunshine and you would probabiliy say it will be sunny. Conveniently 1/10+2/10+3/10+4/10 = 10/10=1.
So for your digit probabilities you might add them up and adjust them so they sum to 1.
You could then pick the largest value.
Regards Conwyn

1 Like

Hmm, I do see what you’re getting at.

But how would I conclude, for the digits case, that the probability of an input being a 0 is, say, 2/10, the probability of the input being a 1 is 4/10, the probability of it being 2 is 1/10, etc. Any pointers in regard to this?

I read online a bit more about the sigmoid function, and gathered that perhaps the following is how this problem would be approached?

The linear combination of the input image is input to the sigmoid/softmax function. Probabilities are then output for each class, and the class label with the highest probability is assigned to the input. Let’s say that this prediction was false. Then the parameters of the linear model would be updated so that when the new linear combination of the same image is input, the softmax function will assign a lower probability to that class label, so the same class label isn’t assigned to the input image.

Is this right?

Chapter 4 Let us return to the basics and ignore the 1943 McCulloch / Pitts work on page 5 and Shannon (1948) Information Theory then we would proceed like this. Find examples of all the 10 digits and assume black/white pictures count the squares and record the number in a frequency historgram so digit X would be mean M and standard deviation S. Repeat for all the other 9 digits. Now enter our single test digit. We would count the black/white squares and look for the mean in the 10 histograms which matched nearest and then select that digit as our guess. Mark this as correct/incorrect and repeat for the other 9 digits. You could then divide the initial data into four squares and build 4 * 10 histrograms. Take our test digit and divide that into four and find the four best matching digits. Pick the most frequent answer. If no clear winner then divide the 4 pieces into 16 pieces and repeat.

Luckily we have the deep learning model where we assume we can have A*input +B = digit. We adjust A and B so that it gets the digit correct as much as possible. The correct / incorrect ratio influences how we adjust A and B.

It might help (or not) but look at page 493 chapter 17 and you can see what is actually happening.
Regards Conwyn

Ohhh, I see.

This exchange has helped gain an idea for how to approach these sorts of tasks. It never occurred to me that I could adjust the intervals/probabilities themselves by adjusting the parameters that some value, such as the number of black/white pixels in an image or the linear combination of an image, is multiplied by; or that I could even multiply an arbitrary value with a parameter to adjust the intervals.

Thank you for your input. :slightly_smiling_face:

Another possible solution would be OneHotEncoding.