Wiki: Lesson 1

An issue I had when trying to do my own sun-vs-moon dataset to try the classifier was that the learn=... line would fail with a deep stack trace that eventually boiled down to "Can’t find im.shape because im is None. I traced it back and eventually discovered that some of my images weren’t being read correctly, resulting in None being passed instead of an array all the way through. In my case I’d used a Google images downloader and hadn’t reviewed the data. There was an .svg files it couldn’t read, and even trickier than that there was a jpg extension file that wa sactually HTML! It wasn’t obvious which files it was having an issue with, so if anyone else has this issue, here’s how to find them:

Edit the fastai/fastai/dataset.py file: Find this line: https://github.com/fastai/fastai/blob/e8433d4a76eaf40c29a74a18ee043b23f6c35dbe/fastai/dataset.py#L145
def get_x(self, i): return open_image(os.path.join(self.path, self.fnames[i]))

If that method returns None, which it will if it can’t open your file as an image, then you’ll eventually get an error when it tries to read its shape. Change it to look like this:

def get_x(self, i): 
    im = open_image(os.path.join(self.path, self.fnames[i]))
    if im is None:
        print('Failed to read as image!')
        print(os.path.join(self.path, self.fnames[i]))
        print(1/0) # Throw an exception now
    return im

Now you’ll get a message telling you which file it couldn’t read, and an immediate exception.

I’ll make a similar PR to fastai tomorrow with a sensible exception, but for now I thought this might help anyone who gets stuck in the same place! :slight_smile:

I’ve made a PR to add better error handling to the open_image call: https://github.com/fastai/fastai/pull/127

5 Likes

Many thanks for the thoughtful analysis an PR. I’ve merged it, since it’s certainly better than what we have now - but I think better still would be to skip over failing files, printing out the names of them, instead of raising an error. That way a single bad file doesn’t break the whole process.

1 Like

Have you solved this? I have the same issue and I don’t really understand the error.

This looks like you’re asking numpy to take 4 random items without replacement from a list of <4 items. If I had a bag of 3 marbles of different colours and asked you to take 4 without replacement, you can’t do it. If I ask you to take 4, one at a time, putting your marble back in the bag after each one, then you can. I’m not sure why you’ve changed this though? For me when I had <4 items in one of the categories it just showed all of them without repeats, what are you trying to do?

Here’s a notebook showing what’s happening: https://notebooks.azure.com/Callum-m/libraries/fastai (np-random-choice-test.ipynb)

Can we use AWS for 2018 version instead of Paperspace?

yes, aws set up instructions are here.

1 Like

Just to be clear, the CNN does not give the probability of the image being a dog. Rather it shows the relative probability of it being a dog vs. being a cat, within those two categories.

I suspect that at some hidden layer, the CNN “knows” that the image of the PetSafe logo is neither a dog nor a cat.

Before studying the lesson details, It seemed reasonable to assume that the CNN would return independent probabilities for dog and cat. For example, if you input a picture of a spruce tree, it would return low probabilities for both dog and cat. But of course the structure of the CNN and the loss function do not allow this. Softmax in the final layer by design returns two probabilities that sum to one, no matter how small its input activations.

This brings up the question of whether we are throwing away information that would be useful in the real-world applications I can imagine. What if we instead map the two activations to some reasonable probability measure, for example, based on the means and variances of all the dog and cat activations? Then you could assess dog, cat, neither, or both. I would try to code this myself, but it’s beyond me when at Lesson 1 and still learning Python.

With such a change, you might see a far more meaningful interpretation for those improperly categorized images. The PetSafe logo might show low probabilities for both dog and cat; the miscategorized image of both a dog and a cat might show high probabilities for both. Wouldn’t you want a self-driving car to assess that the costumed kid that looks most like a green light is rather something never recognized before and not run it over?

2 Likes

Hi! i just started with videos last week and now i’m just testing lesson 1 notebook. i’ve seen that accuracy and confusion matrix results are from validation set. If i want to test my model against a labeled test set ( i created one and have the same structure as train and val dataset) because i want to compute confussion matrix and accuracy using images that were not used in training, how can i do this?

I started loading my test set:
data = ImageClassifierData.from_paths(PATH, tfms=tfms_from_model(arch, sz), test_name=‘test’)

but i’m obtaining this error


FileNotFoundErrorTraceback (most recent call last)
in ()
1 arch=resnet34
----> 2 data = ImageClassifierData.from_paths(PATH, tfms=tfms_from_model(arch, sz), test_name=‘test’)
3 learn = ConvLearner.pretrained(arch, data, precompute=True)
4 learn.fit(0.01, 5)

~/aacevedo/fastai/courses/dl1/fastai/dataset.py in from_paths(cls, path, bs, tfms, trn_name, val_name, test_name, num_workers)
340 “”"
341 trn,val = [folder_source(path, o) for o in (trn_name, val_name)]
–> 342 test_fnames = read_dir(path, test_name) if test_name else None
343 datasets = cls.get_ds(FilesIndexArrayDataset, trn, val, tfms, path=path, test=test_fnames)
344 return cls(path, datasets, bs, num_workers, classes=trn[2])

~/aacevedo/fastai/courses/dl1/fastai/dataset.py in read_dir(path, folder)
36 return [os.path.relpath(f,path) for f in fnames]
37 else:
—> 38 raise FileNotFoundError("{} folder doesn’t exist or is empty".format(folder))
39
40 def read_dirs(path, folder):

FileNotFoundError: test folder doesn’t exist or is empty

i think this is because test directory has the same structure as training and val directories.

How can I do this? i’ve done it before with keras but in don’t know how to do this using pytorch and fast ai api.

thanks! and sorry in advance if this topic is already solved here and i didn’t find it.

@aacevedo

The problem is with your directory structure,
It seems that you haven’t created a test set

A test set will be all those images where all your images will be there when ch haven’t been seen by the model either in the training or validation…

@ecdrid
My directory structure is:
PATH/
train/
class1dir/
class2dir/
class3dir/

val/
class1dir/
class2dir/
class3dir/

test/
class1dir/
class2dir/
class3dir/

I’ve already verified that PATH is correct and there are images at test set that were not used in training nor validation ds. In fact, the script executes fine predicting the validation set.

thanks for your help

@aacevedo
Why separating the test set in classes?
We want our model to predict the classes for us, don’t we?

@ecdrid
Yes, of course but at the end of the prediction i want to know inmediately which images were correctly or not correctly classified as i’m not sending this results to kaggle competition (i’m not using a kaggle dataset).

In such a case we then need to write something that will map the files with actual classes for us in a csv file for us to refer after the predictions were made?

Or lets play smart and move all your test images to validation set and those existing in validation to train and now update the test folder has a single image of different classes?

Or we can use the model to predict on single images…?

Isn’t correct though…

I think this could be a solution but not the ideal one.

I think there is not a requirement of a minimum of images in test set and i could test with one for each class, i think but that would not be a solution neither.

i have been checking ImageClassifierData class and i dont see it is possible to do what i need and it’s quite odd because this is possible in keras.

It’s possible here also but we need to hard-code such stuff(not sure whether it’s already there)
It should be like checking all the files in a particular directory of a particular class and checking with what class our model has predicted that images to be in…(using os, glob etc)

Yes, i think i is required to hard code it unless it is possible to do this with arrays or a cvs file…

Thanks for replying.

Following this discussion with intent, because I too want to have funky images in the test folder and see how the model performs and realized that the current workflow does not probe the test dataset.

Found these threads discussing similar:

Looks like these solve part of the problem:

Here’s my status:
I am able to pass in the test data folder as

and direct the prediction to the test folder by


236 is the number of images in my test folder. What is the ‘2’?

I am able to look at the predictions, by doing:

After this, how do I make it show the image and the predicted class?

This might make it a bit easier if we want to do it ourselves…

Thanks! this helped me a lot. I used your code as i were sending a submission and at least it maps the name of the file with the label assigned. With this, now i can check if testing is doing fine, i just have to code some more things to make it more ‘automatic’