Lesson 4 - Official Topic

I believe this is the implementation of the ReLu activation function.

I assume, you’re talking about the line corrects = (preds>0.0).float() == train_y.
Yes, I think the answer is something like how you’ve guessed.
With the initially random parameters of the linear1 model, one would expect either output of the binary classifier to be equally likely. If you checked against a positive or negative number instead, one output would be more likely than the other at the start which wouldn’t make sense. Therefore 0 makes the most sense to compare with. That’s more or less how I would explain it to myself, I guess.

I doubt that this has anything to do with the ReLu activation function, since this is the linear1 model which is xb@weights + bias. At this point of the notebook we still haven’t introduced any activation function. But I’m no expert :slight_smile:

2 Likes

Tom, the course work is missing again today. Thankfully I saved my project in the storage folder so I have access to that. But all the lessons and book chapters are now missing again as of this morning. My Jupyter notebook is “nix4urqz”

Hi Jason. That is surprising. I’ve recovered your files for you so that you may continue working. In the meanwhile the team and I will debug this specific case further and update you with our progress. Again, please don’t hesitate to reach out should you need further assistance in the meantime.

1 Like

Yes, it’s all back. thank you.

The threshold value and the initialization have (almost) nothing to do
with one another (see clarification below).

You can choose any threshold value thresh (0.0 as in the notebook,
but you can pick -1. or 200.):

  • pred greater than thresh classifies the sample as a 3;
  • pred less than or equal to thresh classifies the sample as a 7.

(or maybe it is the other way around).

Now about initialization.

The data is normalized (with zero mean) and the parameters at
initialization also have zero mean. Because of its architecture, the
output of the model will initially have zero mean as well, with maybe half
>0.0 and half <=0.0.

If thresh is set to 0.0, then the model will adjust its parameters
so that (hopefully) most of the time it will ouput a value > 0.0 for
a 3 and <=0.0 for a 7.

If you pick thresh far away from 0.0, then (heuristically
speaking), the model will first need to “drag” the values of its
parameters so that its output has mean equal to thresh, and then it
will adjust its parameters so that (hopefully) most of the time it
will ouput a value > thresh for a 3 and <= thresh for a 7.
But the end result will be the same. (Again, this is heuristics, there aren’t necessarily two very distinct phases, but hopefully it clarifies ideas.)

In short: you can pick any thresh you want, but it is more efficient
to pick 0.0, for training purposes.

3 Likes

I agree with the last comment about ReLU.

I address your first remark in this answer.

Thanks. I believe my comment is indeed consistent with your answer.

If we want to recognize numbers more than 10 or alphanumeric words or english words, how do we go about?

Do we have to train with those words with multiple images or is there a pre-trained model that can be used?

How about numbers more than 10 like 1213123123 or alphanumber words like ANHRDS1021820?

Regards
Ganesh Bhat

I think what you are looking for is real OCR which is as far as I understand a mix of object detection (find the regions in an image that contain text) and character recognition (like our MNIST example). You’ll find more information here:

Videos no course NLP sobre Regex

video 6 (Rachel)

video 7 (Rachel)

video 9 (Jeremy)

Notebook: https://github.com/fastai/course-nlp/blob/master/4-regex.ipynb

1 Like

Thanks @florianl.

I was trying to take the MNIST example to the next level just to understand if numbers between 0 to 9 and alphabets between A to Z or a to z can be recognized in the similar manner when they form words in images. Once an object is detected (say text block), can we use the above logic to do it? Is it less efficient to OCR or are there any more challenges?

1 Like

I was reading through the below statement and wanted to understand, the below statement where the augmentations applied are different on train and validation sets. Can someone please help me understand what is different?

Random crop and augment : This is in batch_tfms , so it’s applied to a batch all at once on the GPU, which means it’s fast. On the validation set, only the resize to the final size needed for the model is done here. On the training set, the random crop and any other augmentation is done first.

pets = DataBlock(blocks = (ImageBlock, CategoryBlock),
get_items=get_image_files,
splitter=RandomSplitter(seed=42),
get_y=using_attr(RegexLabeller(r’(.+)_\d+.jpg$’), ‘name’),
item_tfms=Resize(460),
batch_tfms=aug_transforms(size=224, min_scale=0.75))
dls = pets.dataloaders(path/“images”)

@ganesh.bhat to understand how it’s applied differently to train vs valid you should look at each transforms source code. Each one contains a split_idx. If it’s 0 it’s applied to the training set, if it’s 1 it’s applied to the validation. (And if there’s none it’s both I think? @sgugger is that right)

Yes, that’s right.

1 Like

I understand the split_idx and the train/ valid mapping thanks to the session by @arora_aman.

Let me rephrase my question - The question is that on the validation set only resize is applied to the final size I.e. 224 whereas on the training all the Augmentations are applied. Is my understanding correct? If yes, why is it so?

Do we do it because we treat it like a test set and try to predict?

If I was trying to look at the distribution of the dataset, what would be the easiest way to do it?
I couldn’t find any method to show the occurence of each class in a dataset…

You can take the output of the regex function which has list of all the classes and convert it into panda dataframe. Value_count will give you the occurrence.

Also, dls.train.vocab or dls.valid.vocab has the occurrence, if I am not wrong.

In the case that we pick a different value for threshold, do we need to adjust our value in the below function too (accuracy test against sigmoid of predictions)? So if we set our thresh to say 1.0, then does it follow that:

correct = (preds>0.7311) ==yb for example, where 0.7311 is the sigmoid(1).

def batch_accuracy(xb, yb):
    preds = xb.sigmoid()
    correct = (preds>0.5) == yb
    return correct.float().mean()

There are in fact two thresholds and my answer in the discussion did not help to dissipate the ambiguity. I will answer your question in two points.

  1. The threshold that I mentioned in my answer refers to the first training approach described in the lesson, somewhere between
    def linear1(xb): return xb@weights + bias
    
    and the chapter Sigmoid. In this approach (which is not implemented, by the way, as it is not practical, as Jeremy explains), the output of linear1 is compared against a threshold to predict a category for the sample, then these predictions are compared against the targets to calculate the accuracty of the model, and the idea is to modify the model in order to improve the accuracy. For emphasis, the sigmoid function does not enter the picture at this point, and the threshold plays no role in training the model (again, in this hypothetical approach).
  2. The second point is that the batch_accuracy which you refer to
    concerns the second training approach. In this approach, the accuracy plays no role in the training of the model but only its evaluation. In turn, the threshold (here set to 0.5) entering the definition of batch_accuracy only affects the accuracy, and does not enter the training of the model.

In principle this should answer your question. On the other hand I think it might not be a waste of time to revisit the narrative of the lesson.

I am looking at notebook 04_mnist_basics.ipynb and will paraphrase quite a bit of it.

As a machine learning practitioner, you are given a sample and asked to determine whether it is a “7” or a “3”. For your model, you pick weights and bias at random and calculate the output pred of your model for your sample x using the function linear1. If pred > 0.0, then you declare that the sample is a “3”, and otherwise a “7”. Of course, there is no reason that this first iteration would give a good accuracy (which compares the predictions preds of all samples against the targets, which are their known labels). The idea is to find a better choice for weights and bias so that the predictions of the samples with this new model will give a better accuracy. For emphasis, the predictions must again be compared with the same threshold value of 0.0. The strategy is that, iterating this process for a sufficiently long time will produce a model with optimal accuracy.

In this process, one could have chosen a different value for the threshold thresh, by declaring that a sample is a “3” if pred > thresh, whatever thresh is, as long as this value is fixed along the iterative process. Therefore, I won’t specify this threshold in the future, it just needs to be fixed along the training.

Note that the output pred to linear1 is not quite the predicted category, but rather (pred > thresh).float(). That is, a “7” is encoded with a 0 and a “3” with a 1.

To summarize, the first training algorithm searches for weights and bias such that the function linear1 maps hopefully most samples which are a “3” to values pred > thresh and otherwise to values pred <= thresh.

This thresh is the threshold that I was referring to in my initial comment to manavk’s question. Note that at this point there is no sigmoid function.

The problem with the training algorithm described above is that there is no obvious procedure to determine how to change weights and bias in order to improve accuracy. Indeed, since targets as well as (preds > thresh).float() are either 0. or 1., except in very special circumstances, wiggling weights and bias will not produce any change in the predictions, and therefore in the accuracy. We need a new approach.

The second training approach brings two important changes. First, instead of computing our prediction in the form of (pred > thresh).float(), giving either 0. or 1., telling us if the sample is a “7” or a “3” respectively, we are going to calculate (a number which can be interpreted as) our level of confidence that the sample is a “3”. This number is calculated as sigmoid(pred), where pred is as before the output to function linear1. Here, sigmoid(pred) is a number between 0 and 1.. Because the sigmoid function varies continuously (as opposed to a jump from 0. to 1.), a small change in weights and bias will produce a small change in sigmoid(pred).

But this number sigmoid(pred) does not answer the question “is the sample a “3” or a “7”?” (This, I would say, really is our prediction.) I will intentionally leave this point for later, to clearly emphasize that we don’t need to make predictions (in the form of either 0. or .1) in order to train the model.

The other important change in the second training approach is that accuracy itself is not used in the training of the model. It is only used to evaluate the model.

Instead, we are going to calculate an error between our ouput sigmoid(pred) and target. This is in essence the loss function mnist_loss. The precise definition is not too important for our discussion, the point is that there is no threshold involved.

The training in this second approach is essentially contained in the functions calc_grad and train_epoch. Again, no threshold there.

Finally, a prediction needs to be either 0. or 1., which we obtain by comparing sigmoid(pred) against another threshold, say thresh2. In other words, our prediction is (sigmoid(pred) > thresh2).float(). This threshold is chosen .5 in batch_accuracy, but it plays a role completely different from thresh in the first training approach. Changing it to a value other than .5 will only change the accuracy of the model, but to say it one last time, this threshold is not used in the training the model.

5 Likes