How to use your own Dataset for Lesson 1

fortydegrees · March 29, 2018, 6:23pm

I am a complete novice with machine learning and have just finished using my own dataset for the first half of lesson 1.
I couldn’t find any step-by-step instructions, and considering it’s the first bit of self-learning required, thought it would be useful if I posted what I did.
Any feedback on the method (inaccuracies made by me) would be appreciated.
This uploading part obviously only applies to those using paperspace.

These resources were helpful to me:

Q3 and Q4 from Reshama’s FAQ in working out how to structure the data (found from this thread)
Beecoder’s Cricket or Baseball example

Either way, I hope this is helpful to a beginner in a similar situation.

Collect at least 15 images for each category
Create a folder locally. I used ‘cowhorse’.
Inside this folder, create two subfolders: ‘train’ and ‘valid’
Inside both of these subfolders (train and valid), create another two subfolders for your two categories. In my case ‘cow’ and ‘horse’.
Put around 80% of your images in the ‘train’ subfolders, and the rest in the ‘valid’ subfolders.
Your finished structure should look like this:

Screen Shot 2018-03-29 at 18.51.31.png918×1162 102 KB
Zip up this folder
SFTP into your paperspace box
Navigate to /fastai/courses/dl1/data/ and upload your .zip folder here
unzip this folder
In your jupyter notebook, change the PATH to the directory you just uploaded
Due to the low sample size, you may want to change the learning rate of the model (I changed mine to 0.2)
Run through all the commands and you should see the model using your new dataset:

Screen Shot 2018-03-29 at 19.22.28.jpg2092×558 251 KB

fortydegrees · March 29, 2018, 6:26pm

As an aside, I’m wondering if someone could shed some light on an error I’m getting when it tries to return incorrect labels:

ValueError: Cannot take a larger sample than population when ‘replace=False’

Is this simply because there are no incorrectly labeled images? I can’t imagine this to be the case as it doesn’t seem like throwing an error is the way to report this.

ramesh · March 29, 2018, 6:34pm

Your validation data is 10 images. rand_by_correct tries to give 4 random images that are correctly or incorrectly classified. Because you don’t have 4 incorrect images, its giving an error. Unfortunately, the 4 is hardcoded. There’s also ImageModelResults that we should probably be using for this task. I am going to take a closer look and get back on best way to fix it.

You could also just change the rand_by_mask function -

def rand_by_mask(mask): return np.random.choice(np.where(mask)[0], 4, replace=False)
# TO
def rand_by_mask(mask):
    mask_idxs = np.where(mask)[0]
    cnt = min(4, len(mask_idxs))
    return np.random.choice(mask_idxs, cnt, replace=False)

beecoder · March 29, 2018, 6:51pm

Great going, this is pretty much what I did. Since you’ve got familiar with data collection for your own project, would be good to explore how the Conv learner works, and the effect of different augmentations, learning rates etc.

jeremy · April 2, 2018, 4:31pm

I believe a recent PR fixed this. Try doing git pull and see if it works then.

AbhigyanBose · June 21, 2018, 8:24am

Is there any way to do this for multiple classes for example suppose cow , horse or sheep ?? I searched the forums but couldn’t find anything like that.

sayko · June 21, 2018, 8:36am

this is covered later in the course
try searching “Dog Breed”

AbhigyanBose · June 21, 2018, 8:59am

I have gone through it, it works with CSV I wanted something which could work with data separated into folders without CSV labels. (Like this dataset is arranged).

sayko · June 21, 2018, 9:25am

Try changing:
ImageClassifierData.from_csv
to
ImageClassifierData.from_paths

AbhigyanBose · June 21, 2018, 9:39am

I didn’t get how to do that. However I think just changing the paths in DogsvsCats is working, I think fastai picks up the number of classes there are because I am getting decent accuracy in my validation sets.

Thank a lot for helping me.

schrodinator · June 27, 2018, 6:29pm

Hi David (and other Deep Learners),
Your explanation about the method of getting your own data set to work with the Lesson 1 notebook is excellent, but a note about best practices. I noticed that cow2.jpg, cow3.jpg, etc. appear in both your “train” and “valid” “cow” folders. Your validation set should never contain images that are also in your training set! The idea of a validation set is similar to that of a test set – it should contain images that the model has never seen before, to give an idea of how the model will perform “in real life.”

A more detailed explanation can be found in the top answer to this StackOverflow question.

ry101 · June 29, 2018, 4:06am

Hi,

I did 3 categories by putting 3 folder +changing the path, for example to flowers.

But I’m not sure I understand the results.

Before numbers close to 1 where dogs, and close to 0 where cats, but how is it with 3 classes?

Also, is it possible to put a picture in a test folder and to get its class?

fortydegrees · June 29, 2018, 8:02am

Thanks for pointing this out. I think that they were different images, just had the same name. But yes, very important!

suniljeph · December 14, 2018, 7:14pm

Hi All, i am new to ML/DL. I was trying to do the Lesson 1 with my own data to differentiate cricket and baseball pictures. i used 15 pictures in each test set and 5 in each test set like @fortydegress David Patt did.
I am using paperspace and getting this error on running the code:
arch=resnet34
data = ImageClassifierData.from_paths(PATH, tfms=tfms_from_model(arch, sz))
learn = ConvLearner.pretrained(arch, data, precompute=True)
learn.fit(0.1, 3)

PermissionError                           Traceback (most recent call last)
<ipython-input-23-e95e065c3e44> in <module>
      1 arch=resnet34
      2 data = ImageClassifierData.from_paths(PATH, tfms=tfms_from_model(arch, sz))
----> 3 learn = ConvLearner.pretrained(arch, data, precompute=True)
      4 learn.fit(0.1, 3)

~/fastai/courses/dl1/fastai/conv_learner.py in pretrained(cls, f, data, ps, xtra_fc, xtra_cut, custom_head, precompute, pretrained, **kwargs)
    112         models = ConvnetBuilder(f, data.c, data.is_multi, data.is_reg,
    113             ps=ps, xtra_fc=xtra_fc, xtra_cut=xtra_cut, custom_head=custom_head, pretrained=pretrained)
--> 114         return cls(data, models, precompute, **kwargs)
    115 
    116     @classmethod

~/fastai/courses/dl1/fastai/conv_learner.py in __init__(self, data, models, precompute, **kwargs)
     95     def __init__(self, data, models, precompute=False, **kwargs):
     96         self.precompute = False
---> 97         super().__init__(data, models, **kwargs)
     98         if hasattr(data, 'is_multi') and not data.is_reg and self.metrics is None:
     99             self.metrics = [accuracy_thresh(0.5)] if self.data.is_multi else [accuracy]

~/fastai/courses/dl1/fastai/learner.py in __init__(self, data, models, opt_fn, tmp_name, models_name, metrics, clip, crit)
     34         self.tmp_path = tmp_name if os.path.isabs(tmp_name) else os.path.join(self.data.path, tmp_name)
     35         self.models_path = models_name if os.path.isabs(models_name) else os.path.join(self.data.path, models_name)
---> 36         os.makedirs(self.tmp_path, exist_ok=True)
     37         os.makedirs(self.models_path, exist_ok=True)
     38         self.crit = crit if crit else self._get_crit(data)

~/anaconda3/envs/fastai/lib/python3.6/os.py in makedirs(name, mode, exist_ok)
    218             return
    219     try:
--> 220         mkdir(name, mode)
    221     except OSError:
    222         # Cannot rely on checking for EEXIST, since the operating system

PermissionError: [Errno 13] Permission denied: '../../../data/baseballcricket/tmp'```

All/any help will be deeply appreciated.

- Sunil

suniljeph · December 14, 2018, 7:29pm

Also, i keep getting this .DS_Store file in and I can’t delete it.

Thanks again for the help.

marcmuc · December 14, 2018, 7:38pm

No you won’t be able to delete it, it’s a mac system file (see here)
But it should be irrelevant because fastai only looks for files with image suffixes (or the one you specify), I think mostly ignores system files starting with dots and if need be you can simply remove it from a list of path objects…

marcmuc · December 14, 2018, 7:42pm

If you think permissions are correct as a workaround just manually create the folders it complains about not being able to create (tmp here)

suniljeph · December 14, 2018, 10:33pm

Thank you! It worked!

parthi2929 · December 15, 2018, 2:57am

What is the equivalent code for fastai 0.7’s rand_by_correct in fastai 1.0??