Lesson 1 part 1 v2 custom images

Fortunately, there are browser plug-ins that will download a lot of images at once
getting image data

In terms of split, can do:
80% / 15% / 5% (train / validation / test)
75% / 15% / 10% (train / validation / test)

@reshama Do you mean test1?

I also found http://www.image-net.org. It seems to have an extensive collection of images.

I don’t think it matters if the directory is called test or test1

train/test directory: you create these and separate the data using your own data sample

Do you mean train/valid?

P.S. If the data set has, let say 50% more image of one class, how would that affect the results?

You can also check out this post: How to scrape the web for images?

There was some discussion on various techniques, this is an open-source solution I used that worked well for me.

I used the Fatkun batch download extension for chrome (suggested in this forum) and downloaded the images(*.jpg) in a SD card. What is the preferred method to split the images into the folders? I am using paperspace from a chromebook.
I found this script from part 1 v1 to create the data folders but don’t know what to do with it: Kaggle Questions

Could someone help me understand how I get the images I download to be available within Paperspace? The folder structure for separating the images makes sense to me, but I’m not sure how to get the images from my laptop to the Paperspace instance.

You don’t have one get it from your laptop to Paperspace…

Paperspace itself is like a computer for you…

Use the terminal of the Paperspace to download the dataset at the appropriate directory location…

So, for example, I used the Image Downloader Chrome extension to grab several basketball and soccer photos from Google images. That process saved them to my Downloads folder on my laptop. What command within terminal on Paperspace do I use to grab the dataset from those folders?

If I’m supposed to do it entirely through Paperspace terminal, I’m not sure how to grab several images at once without using a browser. Any tips?




I had to do a similar task. Suggestions that you might want to try are:

Windows: https://it.cornell.edu/managed-servers/transfer-files-using-putty
Mac/Linux: https://unix.stackexchange.com/questions/188285/how-to-copy-a-file-from-a-remote-server-to-a-local-machine
(in Paperspace, the remote one would be paperspace@xxx.xxx.xxx.xx(Your IP address):/folder/you/want

1 Like

Thank you, YJP!

In case anyone else is in a similar position, this is the command that worked for me on a Mac to transfer from my laptop to the Paperspace instance.

scp /path_to_cricket_folder/cricket{1…10}.jpeg paperspace@xxx.xxx.xxx.xx:/home/paperspace/fastai/courses/dl1/data/cricket

where the xxx.xxx.xxx.xx is the Paperspace IP address.

The bracket notation {1…10} lets me transfer all of the files inside the cricket folder at once, where each file ends in the numbers 1 through 10, e.g. cricket1.jpeg.


This thread helped a lot!
I tried to classify images of beaches and mountains. Created a small dataset of about 80 images in my own system and used scp to transfer the whole folder at once to Paperspace machine.

As mentioned by Reshama, I created a separate folder called Project in the home directory of Paperspace, and that is where I’ll be doing all my experiments. So ‘/home/paperspace/projects/beachesmountains’ would be the PATH in lesson1.ipynb notebook.


Be sure to tell us how you go!

1 Like

So I tried classifying beaches and mountains with a very small dataset (on the lines of what Nikhil B did here Cricket or baseball? Lesson 1 with small datasets)

  1. I used https://github.com/hardikvasa/google-images-download to download 40 images each of Beaches and Mountains. Used about 30% images of these for the validation set.

  2. Tried using LR finder with a reduced batch size (of 2) to find optimal learning rate, but since the number of images is very less, didn’t get decent plot. So with hit and trial, settled on 0.01 as the learning rate.

  3. I got 100% accuracy pretty quick. Probably because the first 40 images of either category downloaded from Google Search are pretty recognizable, and binary classification in such a scenario would not be too difficult.

  4. One thing I could not understand was that the training loss was higher than the validation loss. However both of them kept decreasing even after a number of epochs (after 100% accuracy), indicating that the model was becoming surer of its predictions.

  5. Most uncertain predictions too are separated enough.

  6. Data Augmentation and Differential Learning Rates after unfreezing the model did not have significant impact here.


Great results!
Looks like you got 100% accuracy even though the training loss is relatively high (which usually means underfitting data), and you got these pretty quick. Maybe next you could try something more difficult (rivers vs oceans?), so then you’ll have to adjust learning rate, augmentation etc.

By the way someone else wrote a utility to download many google images here, I haven’t got to trying it yet.


Thanks so much!

@radi http://forums.fast.ai/t/how-to-scrape-the-web-for-images/7446/15 may help for downloading images.

Ok here is my first attempt :slight_smile: Hammers vs Screw Drivers …
GitHub: Lesson 1 HW Hammer vs Screw Driver

  • train folder = 47 pics of each category
  • valid folder = 11 pics of each category
  • played with the Learning Rate
  1. Learning Rate is just right … good accuracy of 91%:

  2. Learning Rate is too small …accuracy stagnates at 54%:

  3. Learning Rate is too big … accuracy is bad at 41% and the loss just explodes!!!

  4. No idea why “learn.sched.plot()” is linear:

  5. Finally …

  6. Some interesting results:

Here are some of the ancillary things I used/learnt:

  • Images by web scrapping: https://github.com/hardikvasa/google-images-download
    ** ran from Windows cmd prompt, no need to call Python prompt, ran without the dollar sign, did pip install
    ** unless specified, it downloads the pics to the ‘Downloads’ folder

  • Image folder structure for Paperscape cloud machine: How to use your own Dataset for Lesson 1

  • Considered using PuTTY to transfer files to Paperscape cloud machine: https://it.cornell.edu/managed-servers/transfer-files-using-putty
    ** but used Jupyter Notebook in Chrome browser to upload the zipped folder but could not unzip inside the browser
    ** went back to Paperscape console to unzip

  • Change PATH in the Jupyter notebook code from the cats vs dogs example to what ever you chose
    ** there were couple of other locations where the phrase cats/dogs had to be renamed to your classes

  • Password for Paperscape cloud machine could not be copy pasted, I have to type it every time :face_with_raised_eyebrow:

  • If you cannot access the Jupyter Notebook in the browser then check if your network provider is blocking port 8888

  • If you frequently stop and start Jupyter Notebook then the default port 8888 can become unavailable and Paperspace machine will start opening Jupyter Notebook with subsequent ports like 8889 etc

Thanks :+1:

1 Like

Update to the Most Uncertain screw driver:
@jeremy may have answered why this last screw driver has such an ambiguous score of 0.48. According to his Lesson 2 DL 2018 (https://www.youtube.com/watch?v=JNxcznsrRb8), the software is written in such a way that the test set pictures are cropped to square shape, so the middle of this picture is not that obvious. It can be overcome by data augmentation, which I did not play with in my 1st homework. Thanks @jeremy for such an awesome tutorial :hugs: