Np man. It can get quite confusing indeed.
If your output is of shape 118, 2 that means you have 118 images and 2 classes.
The 2 outputs per image are the probabilities of the 2 classes. Softmax (or sigmoid) has already been applied to the output of the last NN layer.
You can check if this is the case. They should sum up to 1, image-wise. Do they?
E.g. by definition preds should sum up to 118, given each row sums up to 1.
Having said that, what is the actual prediction of the model?
Well, that’s the class with the highest probability per-image.
So, if for image1 your preds are (0.91, 0.09) then class 0 is the prediction.
To do that programmatically, you need to apply argmax to the preds row-wise, e.g. preds.argmax(dim=0).
The output will be a tensor of integers (either 0 or 1, the predicted classes) of shape (118, 1).
Good, now you have predictions.
How do you measure accuracy?
To do that, you need the ground truths.
Ideally you’ll have somewhere the labels for the test set.
Either an encoded tensor of integers (0 or 1) of shape (118, 1) or a list of len 118 with the strings with the actual labels, e.g. [cat, dog, dog, dog, cat, …].
If it is the latter you need to encode them into integers.
But how do you know which class is 0 or 1? learn.dls.vocab to the rescue.
You’ll get something like ['cat', 'dog] which is telling you that cat=0 and dog=1.
Let’s call the tensor of encoded ground truths gts.
Now the last step is: accuracy = (gts == preds.argmax(dim=0)).average()
Something like that.
I didn’t check the syntax and I am sure it is wrong but you hopefully get the point.
So, the parameter valid_pct=0.2 means that fastai will hold out 20% of the input data and not use it for training. So in essence, this means that fastai assumes that the dataset we pass it, will be the full dataset every time.
What if we already have separate datasets for training versus testing? So this means we would want to use the full set passed in (training set) and fastai should not default the valid_pct to 0.2. How do we get by this?
Try this same approach on the validation set.
You should get the accuracy number you got on the last epoch of training.
E.g. run learn.validate, then replicate the above procedure to the validation set and check if you get the same number.
Each time I did that ,and numbers didn’t match, that meant I had accidentally shuffled the images, and ground truths and predictions didn’t align anymore.
In here I was working with a df_fold pandas dataframe where the which column contained the training valid split. I had created the split myself as random was not good in that case.
As you can see it gets as flexible as you want.
Thank you. I think the only thing that I had to meddle with a bit was the dim=0 argument passed into argmax since that should be dim=1 otherwise we just get a single value. I’m writing up how I worked this in a blogpost, though I think and hope I’ll get some more practice manipulating tensors in upcoming lessons since it felt a bit counterintuitive to get my head round how that worked. (I get ~92% accuracy on my held out test set, which is pretty nice to see. I guess my cat is unique after all )
It’s the same for us all. More to the point, a few years ago I read a post on Terence Tao’s blog in which he stated more or less the same thing. So, you are in good company.
Now for my own dataset, I used images of zucchini and cucumber to train a classification model .It correctly classified the class and predicted .
I took another dataset ,alligator vs crocodile to train a classification model. I downloaded dataset of alligator and crocodile .when training the model .when I print dataloaders (dls) getting different images
In the sidebar of kaggle.com it shows my GPU usage. Its not clear to me whether I’ve used:
(a) 2 min1. utes 12 seconds, or
(b) 2 hours 12 minutes.
I failed to find the answer in 10 minutes of googling, so I’m asking.
If answer is (b), then I’d feed back to kaggle team that “30 hrs” be instead written as “30:00 hrs”. Then I would not have needed to ask.
Now the code is working fine, However I am not sure what is happening, because I am getting accuracy between 57% and 64% (in different runs).
Also I have made the kernel public now.
Maybe things like this (from your notebook) are why it’s finding it hard to train. Also I wonder whether crocodile / alligator are something where people upload or publish pictures and label it as ‘crocodile’, while in reality it’s actually an alligator (and vice-versa). i.e. the problem’s really in the data. It’s a nice example of why problems in your ground truth data can cause upstream issues.
As far as I can see, you would either choose an example where it is less likely that randomly downloaded images will be wrong (i.e. like cat vs dog etc), or you find a dataset where you are sure that the labels are correct. Perhaps there was some scientist online who studies crocodiles and you can be sure that those images are really crocodiles. I wouldn’t know where to go to find those images, however…
Image search is a very interesting but often deeply flawed source of images. I tried searching for things like “woman in a blue t-shirt” and 60% of the results lack either a t-shirt or a woman or the t-shirts have invalid colors. Going deeper into these results clearly show that despite the fact that Google has the best computer vision models, the image search results are still mostly based on associating the images with the surrounding text on a webpage.
The interesting thing is that all the biggest models nowadays (CLIP, DALL•E, etc.) are trained on images and text scrapped from the web but seem to work despite having 50% (my guess) noise in the ground truth.
TLDR; Always take a long and careful look into the training data you are using.