Create an image dataset from scratch

@jeremy
Will BMP formats for the images be OK? Thank you for the feedback.

I created my own cats and dogs validation dataset by scrapping some dogs and cats photo from http://www.catbreedslist.com. It has high definition photos of 65 breeds of cats and 369 breeds of dogs.

Though the file names were different from the standard, it worked just fine just as Jeremy has mentioned above. Though you need to maintain the folder structure.

It gave me a 100% accuracy on the already trained model. one difficulty that i faced was i couldn’t find where to specify the location of the new validation dataset. i had to rename it “valid” and change the old “valid” to something else.

I created a Pinterest scraper a while ago which will download all the images from a Pinterest board or a list of boards. It hasn’t been maintained in over a year so use at your own risk (and as of this writing, only supports Python 2.7 but I plan to update it once I get to that part in this lesson.) You’ll also need to install selenium for web scraping and a webdriver for Chrome.

Or you can create your own scrapers: http://automatetheboringstuff.com/chapter11/

2 Likes

That looks like a rather cool book!

Yep, that was the book I used to teach myself Python… and now I’m ready to learn how to use Deep Learning to further automate the boring stuff.

you can now download images for a specific format using the above github repository

hardikvasa/google-images-download

$ googleimagesdownload -k <keyword> -f jpg

12 Likes

Terrific! Do you have a twitter handle? Would love to share this project.

1 Like

Thanks Jeremy :slight_smile:

I do not have an active Twitter handle but it would be great if you could share this project.

I am adding new features into this repo every week and would love to hear what common features does folks on this forum need. That way I can plan an integrate those features into the repo.

1 Like

Are you open to creating one? It’s the best way I have to credit people’s work. It’s also where nearly all my favorite deep learning practitioners and researchers discuss their work.

(Obviously it’s entirely up to you - just wanted to let you know my thinking.)

Hi @jeremy , sure thing!

re-activated my handle from last year… @hnvasa15 it is :slight_smile:

Thanks!

2 Likes

I’m halfway through creating a python script to take your downloads from google_images_download and split them by whatever percentages you want.

Before I finish, I just realized :worried: I should make sure what we want is a directory structure like in dogscats/.

dogscats
    |-- train
          |-- cats
                |-- catpic0, catpic1, …
          |-- dogs/
                |-- dogpic0, dogpic1, …
    |-- valid
          |-- cats
                |-- catpic0+x, catpic1+x, …
          |-- dogs
                |-- dogpic0+x, dogpic1+x, …
    |-- test
           |-- catpic0+x+y, catpic1+x+y, dogpic0+x+y, dogpic1+x+y, …

Thanks,
Ben

@benlove Tip: run this query and you will be amazed :slight_smile:

$ googleimagesdownload --keywords "cats,dogs" -l 1000 -ri -cd <path/to/chromedriver>

Beware of what limit you set here because the above query can go up to 140k + images (more than 70k each) if you would want to build a humongous dataset. You will still have to put it in correct directory structure though.

You can also use the -o argument to specify the name of the main directory. So it does not always have to be ‘downloads/’

2 Likes

Oh, @hnvasa, that’s cool.
I didn’t consider just making the downloads directory the name I wanted. Much simpler!

Hi @benlove , I have questions regarding directory structure.
Does your directory structure work when running model or should I use similar structure as in dogscats as shown below:

/home/ubuntu/data/dogscats/
├── models
├── sample
│ ├──── models
│ ├──── tmp
│ ├──── train
│ │ ├────── cats
│ │ └────── dogs
│ └──── valid
│ ├────── cats
│ └────── dogs
├── test
├── train
│ ├──── cats
│ └──── dogs
└── valid
├──── cats
└──── dogs

Big Thanks for the answer!

Chadst

1 Like

I ran it with my structure above and it worked on most cells but not all.
When I ran it, it did create the models and tmp directories on the same level as train/valid/test though.
I haven’t investigated the errors, but I got a ValueError on
# 2. A few incorrect labels at random plot_val_with_title(rand_by_correct(False), "Incorrectly classified")

and cells like

plot_val_with_title(most_by_correct(0, False), "Most incorrect cats")
and
plot_val_with_title(most_by_correct(1, False), "Most incorrect dogs")

resulted in <Figure size 1152x576 with 0 Axes> . So I’ll have to look into what’s up there. But I got some pretty cool results otherwise.

Looking at that example from @reshama it looks like I need a sample folder too. Thanks!

@benlove do you know what data should be inside sample directory?

@chadst88, I can’t answer that. I don’t know if we actually need anything there for our own images. The sample directory in dogscats has train and valid directories, each with a cats and a dogs directory. There are also a couple of np array files (cached or compiled? idk). Maybe someone else can shed some light on what the sample directory is for. It looked like the video from lesson 1 was about to discuss each directory but then moved on.

The samples dir is just if you want to work on a subset of the data for some reason.

However, it’s more flexible to just use a CSV, as we do for the Planet dataset.

2 Likes

Hi @jeremy if I want to replicate model with different datasets, should I fill in samples dir or just left them blank? Thanks

Hi Ben,

Were you able figure out why you got the errors listed above? As I got similar errors too.

Thanks!