How to use your own Dataset for Lesson 1

I am a complete novice with machine learning and have just finished using my own dataset for the first half of lesson 1.
I couldn’t find any step-by-step instructions, and considering it’s the first bit of self-learning required, thought it would be useful if I posted what I did.
Any feedback on the method (inaccuracies made by me) would be appreciated.
This uploading part obviously only applies to those using paperspace.

These resources were helpful to me:

Either way, I hope this is helpful to a beginner in a similar situation.

  • Collect at least 15 images for each category
  • Create a folder locally. I used ‘cowhorse’.
  • Inside this folder, create two subfolders: ‘train’ and ‘valid’
  • Inside both of these subfolders (train and valid), create another two subfolders for your two categories. In my case ‘cow’ and ‘horse’.
  • Put around 80% of your images in the ‘train’ subfolders, and the rest in the ‘valid’ subfolders.
  • Your finished structure should look like this:
  • Zip up this folder
  • SFTP into your paperspace box
  • Navigate to /fastai/courses/dl1/data/ and upload your .zip folder here
  • unzip this folder
  • In your jupyter notebook, change the PATH to the directory you just uploaded
  • Due to the low sample size, you may want to change the learning rate of the model (I changed mine to 0.2)
  • Run through all the commands and you should see the model using your new dataset:

As an aside, I’m wondering if someone could shed some light on an error I’m getting when it tries to return incorrect labels:

ValueError: Cannot take a larger sample than population when ‘replace=False’

Is this simply because there are no incorrectly labeled images? I can’t imagine this to be the case as it doesn’t seem like throwing an error is the way to report this.

Your validation data is 10 images. rand_by_correct tries to give 4 random images that are correctly or incorrectly classified. Because you don’t have 4 incorrect images, its giving an error. Unfortunately, the 4 is hardcoded. There’s also ImageModelResults that we should probably be using for this task. I am going to take a closer look and get back on best way to fix it.

You could also just change the rand_by_mask function -

def rand_by_mask(mask): return np.random.choice(np.where(mask)[0], 4, replace=False)
# TO
def rand_by_mask(mask):
    mask_idxs = np.where(mask)[0]
    cnt = min(4, len(mask_idxs))
    return np.random.choice(mask_idxs, cnt, replace=False)

Great going, this is pretty much what I did. Since you’ve got familiar with data collection for your own project, would be good to explore how the Conv learner works, and the effect of different augmentations, learning rates etc.

1 Like

I believe a recent PR fixed this. Try doing git pull and see if it works then.

1 Like

Is there any way to do this for multiple classes for example suppose cow , horse or sheep ?? I searched the forums but couldn’t find anything like that.

this is covered later in the course
try searching “Dog Breed”

I have gone through it, it works with CSV I wanted something which could work with data separated into folders without CSV labels. (Like this dataset is arranged).

Try changing:

1 Like

I didn’t get how to do that. However I think just changing the paths in DogsvsCats is working, I think fastai picks up the number of classes there are because I am getting decent accuracy in my validation sets.

Thank a lot for helping me.

Hi David (and other Deep Learners),
Your explanation about the method of getting your own data set to work with the Lesson 1 notebook is excellent, but a note about best practices. I noticed that cow2.jpg, cow3.jpg, etc. appear in both your “train” and “valid” “cow” folders. Your validation set should never contain images that are also in your training set! The idea of a validation set is similar to that of a test set – it should contain images that the model has never seen before, to give an idea of how the model will perform “in real life.”

A more detailed explanation can be found in the top answer to this StackOverflow question.

1 Like


I did 3 categories by putting 3 folder +changing the path, for example to flowers.

But I’m not sure I understand the results.

Before numbers close to 1 where dogs, and close to 0 where cats, but how is it with 3 classes?

Also, is it possible to put a picture in a test folder and to get its class?

Thanks for pointing this out. I think that they were different images, just had the same name. But yes, very important!

Hi All, i am new to ML/DL. I was trying to do the Lesson 1 with my own data to differentiate cricket and baseball pictures. i used 15 pictures in each test set and 5 in each test set like @fortydegress David Patt did.
I am using paperspace and getting this error on running the code:
data = ImageClassifierData.from_paths(PATH, tfms=tfms_from_model(arch, sz))
learn = ConvLearner.pretrained(arch, data, precompute=True), 3)

PermissionError                           Traceback (most recent call last)
<ipython-input-23-e95e065c3e44> in <module>
      1 arch=resnet34
      2 data = ImageClassifierData.from_paths(PATH, tfms=tfms_from_model(arch, sz))
----> 3 learn = ConvLearner.pretrained(arch, data, precompute=True)
      4, 3)

~/fastai/courses/dl1/fastai/ in pretrained(cls, f, data, ps, xtra_fc, xtra_cut, custom_head, precompute, pretrained, **kwargs)
    112         models = ConvnetBuilder(f, data.c, data.is_multi, data.is_reg,
    113             ps=ps, xtra_fc=xtra_fc, xtra_cut=xtra_cut, custom_head=custom_head, pretrained=pretrained)
--> 114         return cls(data, models, precompute, **kwargs)
    116     @classmethod

~/fastai/courses/dl1/fastai/ in __init__(self, data, models, precompute, **kwargs)
     95     def __init__(self, data, models, precompute=False, **kwargs):
     96         self.precompute = False
---> 97         super().__init__(data, models, **kwargs)
     98         if hasattr(data, 'is_multi') and not data.is_reg and self.metrics is None:
     99             self.metrics = [accuracy_thresh(0.5)] if else [accuracy]

~/fastai/courses/dl1/fastai/ in __init__(self, data, models, opt_fn, tmp_name, models_name, metrics, clip, crit)
     34         self.tmp_path = tmp_name if os.path.isabs(tmp_name) else os.path.join(, tmp_name)
     35         self.models_path = models_name if os.path.isabs(models_name) else os.path.join(, models_name)
---> 36         os.makedirs(self.tmp_path, exist_ok=True)
     37         os.makedirs(self.models_path, exist_ok=True)
     38         self.crit = crit if crit else self._get_crit(data)

~/anaconda3/envs/fastai/lib/python3.6/ in makedirs(name, mode, exist_ok)
    218             return
    219     try:
--> 220         mkdir(name, mode)
    221     except OSError:
    222         # Cannot rely on checking for EEXIST, since the operating system

PermissionError: [Errno 13] Permission denied: '../../../data/baseballcricket/tmp'```

All/any help will be deeply appreciated.

- Sunil

Also, i keep getting this .DS_Store file in and I can’t delete it.

Thanks again for the help.

No you won’t be able to delete it, it’s a mac system file (see here)
But it should be irrelevant because fastai only looks for files with image suffixes (or the one you specify), I think mostly ignores system files starting with dots and if need be you can simply remove it from a list of path objects…

1 Like

If you think permissions are correct as a workaround just manually create the folders it complains about not being able to create (tmp here)

1 Like

Thank you! It worked!

What is the equivalent code for fastai 0.7’s rand_by_correct in fastai 1.0??