Lesson 1 part 1 v2 custom images

(Radi) #1

At the end of the video Jeremy suggests trying running the image classification with our own image set.

How do you go about it?

My understanding is that I need to download a bunch of images two different objects and put those in different folders, but I am not clear on the folder naming or location.

The current dataset folder structure looks like this:

ls data/dogscats/

Let’s say that I want to classify bears and deer. I assume that I need to create a folder in data/bearsdeer/train/ and put a bunch of bear and deer images in that. After that put another bunch bears, deer and other images in data/bearsdeer/valid. Is that correct?

What are the other folders for and do I need to worry about those?

(Reshama Shaikh) #2

I’ve written up answers to your questions, and added it to FAQs for beginners:
sample directory structure

(Radi) #3

I am confused by your bullet point 3 and 4.

Let’s say I got 100 images of bears and 100 images of deer.
How many of those should I put the sample folder?
How many in the test1?
How many in the train?
How many in the valid?

(Aditya) #4

Since the dataset is small enough,
So we can skip the sample dir.

train will contain 80 images out of which 5 for validation and the remaining 20 for test might be a reasonable split..

Generally people follow 80-20 splitting

(Radi) #5


out of which 5 for validation

What do you mean by that? Please be specific.

Do you know of a website that I could download a larger data set of animals? I was going to download images from google image search manually. Is there a faster way?

(Reshama Shaikh) #6

Fortunately, there are browser plug-ins that will download a lot of images at once
getting image data

In terms of split, can do:
80% / 15% / 5% (train / validation / test)
75% / 15% / 10% (train / validation / test)

(Radi) #7

@reshama Do you mean test1?

I also found http://www.image-net.org. It seems to have an extensive collection of images.

(Reshama Shaikh) #8

I don’t think it matters if the directory is called test or test1

(Radi) #9

train/test directory: you create these and separate the data using your own data sample

Do you mean train/valid?

P.S. If the data set has, let say 50% more image of one class, how would that affect the results?

(WG) #10

You can also check out this post: How to scrape the web for images?

There was some discussion on various techniques, this is an open-source solution I used that worked well for me.

(Sandip) #11

I used the Fatkun batch download extension for chrome (suggested in this forum) and downloaded the images(*.jpg) in a SD card. What is the preferred method to split the images into the folders? I am using paperspace from a chromebook.
I found this script from part 1 v1 to create the data folders but don’t know what to do with it: Kaggle Questions

(Chris Pontarolo-Maag) #12

Could someone help me understand how I get the images I download to be available within Paperspace? The folder structure for separating the images makes sense to me, but I’m not sure how to get the images from my laptop to the Paperspace instance.

(Aditya) #13

You don’t have one get it from your laptop to Paperspace…

Paperspace itself is like a computer for you…

Use the terminal of the Paperspace to download the dataset at the appropriate directory location…

(Chris Pontarolo-Maag) #14

So, for example, I used the Image Downloader Chrome extension to grab several basketball and soccer photos from Google images. That process saved them to my Downloads folder on my laptop. What command within terminal on Paperspace do I use to grab the dataset from those folders?

If I’m supposed to do it entirely through Paperspace terminal, I’m not sure how to grab several images at once without using a browser. Any tips?



(YJ Park) #15


I had to do a similar task. Suggestions that you might want to try are:

Windows: https://it.cornell.edu/managed-servers/transfer-files-using-putty
Mac/Linux: https://unix.stackexchange.com/questions/188285/how-to-copy-a-file-from-a-remote-server-to-a-local-machine
(in Paperspace, the remote one would be paperspace@xxx.xxx.xxx.xx(Your IP address):/folder/you/want

(Chris Pontarolo-Maag) #16

Thank you, YJP!

In case anyone else is in a similar position, this is the command that worked for me on a Mac to transfer from my laptop to the Paperspace instance.

scp /path_to_cricket_folder/cricket{1…10}.jpeg paperspace@xxx.xxx.xxx.xx:/home/paperspace/fastai/courses/dl1/data/cricket

where the xxx.xxx.xxx.xx is the Paperspace IP address.

The bracket notation {1…10} lets me transfer all of the files inside the cricket folder at once, where each file ends in the numbers 1 through 10, e.g. cricket1.jpeg.


This thread helped a lot!
I tried to classify images of beaches and mountains. Created a small dataset of about 80 images in my own system and used scp to transfer the whole folder at once to Paperspace machine.

As mentioned by Reshama, I created a separate folder called Project in the home directory of Paperspace, and that is where I’ll be doing all my experiments. So ‘/home/paperspace/projects/beachesmountains’ would be the PATH in lesson1.ipynb notebook.

(Jeremy Howard) #18

Be sure to tell us how you go!


So I tried classifying beaches and mountains with a very small dataset (on the lines of what Nikhil B did here Cricket or baseball? Lesson 1 with small datasets)

  1. I used https://github.com/hardikvasa/google-images-download to download 40 images each of Beaches and Mountains. Used about 30% images of these for the validation set.

  2. Tried using LR finder with a reduced batch size (of 2) to find optimal learning rate, but since the number of images is very less, didn’t get decent plot. So with hit and trial, settled on 0.01 as the learning rate.

  3. I got 100% accuracy pretty quick. Probably because the first 40 images of either category downloaded from Google Search are pretty recognizable, and binary classification in such a scenario would not be too difficult.

  4. One thing I could not understand was that the training loss was higher than the validation loss. However both of them kept decreasing even after a number of epochs (after 100% accuracy), indicating that the model was becoming surer of its predictions.

  5. Most uncertain predictions too are separated enough.

  6. Data Augmentation and Differential Learning Rates after unfreezing the model did not have significant impact here.

(Nikhil B ) #20

Great results!
Looks like you got 100% accuracy even though the training loss is relatively high (which usually means underfitting data), and you got these pretty quick. Maybe next you could try something more difficult (rivers vs oceans?), so then you’ll have to adjust learning rate, augmentation etc.

By the way someone else wrote a utility to download many google images here, I haven’t got to trying it yet.