Lesson 1 part 1 v2 custom images

radi · January 22, 2018, 1:18pm

At the end of the video Jeremy suggests trying running the image classification with our own image set.

How do you go about it?

My understanding is that I need to download a bunch of images two different objects and put those in different folders, but I am not clear on the folder naming or location.

The current dataset folder structure looks like this:

ls data/dogscats/
  models/
  sample/
  test1/
  tmp/
  train/
  valid/

Let’s say that I want to classify bears and deer. I assume that I need to create a folder in data/bearsdeer/train/ and put a bunch of bear and deer images in that. After that put another bunch bears, deer and other images in data/bearsdeer/valid. Is that correct?

What are the other folders for and do I need to worry about those?

reshama · January 22, 2018, 2:49pm

@radi
I’ve written up answers to your questions, and added it to FAQs for beginners:
sample directory structure

radi · January 23, 2018, 12:42pm

I am confused by your bullet point 3 and 4.

Let’s say I got 100 images of bears and 100 images of deer.
How many of those should I put the sample folder?
How many in the test1?
How many in the train?
How many in the valid?

ecdrid · January 23, 2018, 12:46pm

Since the dataset is small enough,
So we can skip the sample dir.

train will contain 80 images out of which 5 for validation and the remaining 20 for test might be a reasonable split..

Generally people follow 80-20 splitting

radi · January 23, 2018, 12:54pm

@ecdrid

out of which 5 for validation

What do you mean by that? Please be specific.

Do you know of a website that I could download a larger data set of animals? I was going to download images from google image search manually. Is there a faster way?

reshama · January 23, 2018, 2:21pm

Fortunately, there are browser plug-ins that will download a lot of images at once
getting image data

In terms of split, can do:
80% / 15% / 5% (train / validation / test)
75% / 15% / 10% (train / validation / test)

radi · January 23, 2018, 3:22pm

@reshama Do you mean test1?

I also found http://www.image-net.org. It seems to have an extensive collection of images.

reshama · January 23, 2018, 5:00pm

I don’t think it matters if the directory is called test or test1

radi · January 23, 2018, 7:25pm

train/test directory: you create these and separate the data using your own data sample

Do you mean train/valid?

P.S. If the data set has, let say 50% more image of one class, how would that affect the results?

wgpubs · January 23, 2018, 7:32pm

You can also check out this post: How to scrape the web for images?

There was some discussion on various techniques, this is an open-source solution I used that worked well for me.

sandip · January 24, 2018, 10:08pm

I used the Fatkun batch download extension for chrome (suggested in this forum) and downloaded the images(*.jpg) in a SD card. What is the preferred method to split the images into the folders? I am using paperspace from a chromebook.
I found this script from part 1 v1 to create the data folders but don’t know what to do with it: Kaggle Questions

chrispmaag · January 25, 2018, 8:36am

Could someone help me understand how I get the images I download to be available within Paperspace? The folder structure for separating the images makes sense to me, but I’m not sure how to get the images from my laptop to the Paperspace instance.

ecdrid · January 25, 2018, 9:01am

You don’t have one get it from your laptop to Paperspace…

Paperspace itself is like a computer for you…

Use the terminal of the Paperspace to download the dataset at the appropriate directory location…

chrispmaag · January 25, 2018, 9:10am

So, for example, I used the Image Downloader Chrome extension to grab several basketball and soccer photos from Google images. That process saved them to my Downloads folder on my laptop. What command within terminal on Paperspace do I use to grab the dataset from those folders?

If I’m supposed to do it entirely through Paperspace terminal, I’m not sure how to grab several images at once without using a browser. Any tips?

YJP · January 25, 2018, 10:21am

Hello,

I had to do a similar task. Suggestions that you might want to try are:

Windows: https://it.cornell.edu/managed-servers/transfer-files-using-putty
Mac/Linux: https://unix.stackexchange.com/questions/188285/how-to-copy-a-file-from-a-remote-server-to-a-local-machine
(in Paperspace, the remote one would be paperspace@xxx.xxx.xxx.xx(Your IP address):/folder/you/want

chrispmaag · January 27, 2018, 6:55pm

Thank you, YJP!

In case anyone else is in a similar position, this is the command that worked for me on a Mac to transfer from my laptop to the Paperspace instance.

scp /path_to_cricket_folder/cricket{1…10}.jpeg paperspace@xxx.xxx.xxx.xx:/home/paperspace/fastai/courses/dl1/data/cricket

where the xxx.xxx.xxx.xx is the Paperspace IP address.

The bracket notation {1…10} lets me transfer all of the files inside the cricket folder at once, where each file ends in the numbers 1 through 10, e.g. cricket1.jpeg.

priyal · February 2, 2018, 3:05pm

This thread helped a lot!
I tried to classify images of beaches and mountains. Created a small dataset of about 80 images in my own system and used scp to transfer the whole folder at once to Paperspace machine.

As mentioned by Reshama, I created a separate folder called Project in the home directory of Paperspace, and that is where I’ll be doing all my experiments. So ‘/home/paperspace/projects/beachesmountains’ would be the PATH in lesson1.ipynb notebook.

jeremy · February 4, 2018, 3:46am

Be sure to tell us how you go!

priyal · February 10, 2018, 4:51pm

So I tried classifying beaches and mountains with a very small dataset (on the lines of what Nikhil B did here Cricket or baseball? Lesson 1 with small datasets)

I used https://github.com/hardikvasa/google-images-download to download 40 images each of Beaches and Mountains. Used about 30% images of these for the validation set.
Tried using LR finder with a reduced batch size (of 2) to find optimal learning rate, but since the number of images is very less, didn’t get decent plot. So with hit and trial, settled on 0.01 as the learning rate.

Screenshot from 2018-02-10 21-58-27.png1392×751 50.9 KB
Screenshot from 2018-02-10 21-58-39.png978×458 29.4 KB
I got 100% accuracy pretty quick. Probably because the first 40 images of either category downloaded from Google Search are pretty recognizable, and binary classification in such a scenario would not be too difficult.

Screenshot from 2018-02-10 21-57-32.jpg1458×719 322 KB
One thing I could not understand was that the training loss was higher than the validation loss. However both of them kept decreasing even after a number of epochs (after 100% accuracy), indicating that the model was becoming surer of its predictions.

Screenshot from 2018-02-10 21-56-05.png1477×527 39.9 KB
Most uncertain predictions too are separated enough.

Screenshot from 2018-02-10 21-57-40.png1423×404 333 KB
Data Augmentation and Differential Learning Rates after unfreezing the model did not have significant impact here.

beecoder · February 10, 2018, 5:11pm

Great results!
Looks like you got 100% accuracy even though the training loss is relatively high (which usually means underfitting data), and you got these pretty quick. Maybe next you could try something more difficult (rivers vs oceans?), so then you’ll have to adjust learning rate, augmentation etc.

By the way someone else wrote a utility to download many google images here, I haven’t got to trying it yet.