Create an image dataset from scratch


#1

Hello everyone,

In the first lesson of Part 1 v2, Jeremy encourages us to test the notebook on our own dataset. I know that there are some dataset already existing on Kaggle but it would certainly be nice to construct our personal ones to test our own ideas and find the limits of what neural networks can and cannot achieve.

Several people already indicated ways to do this (at least partially) and I thought it might be nice to try to make a special tread for it, where we regroup these ideas.

I’m a real beginner with very little experience, so I will try to do a detailed list of the steps required to get an image dataset, and then reference what people mentioned on this forum to do it.

  1. Download a set of images from somewhere.
  2. Make sure they have the same extension (.jpg or .png for instance)
  3. Make sure that they are named according to the convention of the first notebook i.e. class.number.extension for instance cat.14.jpg)
  4. Split them in different subsets like train, valid, and test.

Here is what I found and was able to use

  1. https://github.com/hardikvasa/google-images-download
    This python script create a different folder for each keyword. I tested and it works fine.
  2. If we used https://github.com/hardikvasa/google-images-download to download images we get both .png and .jpg files. Until now all I can do is remove all the .png but maybe it is possible to convert .png into .jpg or the other way around.
  3. I guess it shouldn’t be that hard with some bash scripting or the right python libraries but I don’t know anything about it. If someone knows some tutorial to learn how to manipulates files and directories with python I would be glad to have a reference.
  4. I think that create_sample_folder presented here:
    http://forums.fast.ai/t/dogs-vs-cats-lessons-learned-share-your-experiences/1656/37
    would do the job.

If someone has a script for points 2) and 3) it would be nice to share it. And if some of you have recommendations/experience concerning the creation of an image dataset, it would of course be cool to share it too.

Best


(Aditya) #2
  • If you are on Windows, then navigate to that particular directory where you have your .png files, just run the following command in cmd ren *.* *.jpg.(warning it will cahnge all files to png, make sure you are in the correct place or have a copy of all the files) or the safer version ren *.png *.jpg.

  • If you are on Ubuntu, then type rename .png .jpg (not quite sure) but you can surely do man rename

We can interchange *.png to *.jpg , It will not cause any problems…

  • this repository does the re-naming into 1.jpg, 2.jpg … etc

(Aditya) #3

In Bash (something equivalent to this)

for f in *.png; do
    mv "$f" "${i%.png}.jpg"
done

@abercher


(Asif Imran) #4

I doubt renaming files from *.png to *.jpg actually does any conversion (at least via mv) — png and jpg are two very different image formats. A handy-dandy command-line utility for manipulating images is imagemagick. You can use apt-get on linux or brew install on osx to install it on your system. Afterwards, you can batch convert like so:

for i in *.png ; do convert "$i" "${i%.*}.jpg" ; done

where convert is part of the imagemagick toolbox. You will still want to verify by hand a couple of images that the conversion went thru as expected (sometimes, pngs with transparent background can confuse imagemagick — google if you are stuck).


(Jeremy Howard) #5

Thanks for creating this thread! Just to clarify - the names aren’t important really. What matters is the name of the directory that they’re in.


Data Preparation
#6

Thanks a lot for your answer Jeremy!

I didn’t realize this part. It makes life simpler!

And thank you for all this amazing material and support!


(Benjamin DeKoven) #7

@jeremy
Will BMP formats for the images be OK? Thank you for the feedback.


(Shivam Goel) #8

I created my own cats and dogs validation dataset by scrapping some dogs and cats photo from http://www.catbreedslist.com. It has high definition photos of 65 breeds of cats and 369 breeds of dogs.

Though the file names were different from the standard, it worked just fine just as Jeremy has mentioned above. Though you need to maintain the folder structure.

It gave me a 100% accuracy on the already trained model. one difficulty that i faced was i couldn’t find where to specify the location of the new validation dataset. i had to rename it “valid” and change the old “valid” to something else.


#9

I created a Pinterest scraper a while ago which will download all the images from a Pinterest board or a list of boards. It hasn’t been maintained in over a year so use at your own risk (and as of this writing, only supports Python 2.7 but I plan to update it once I get to that part in this lesson.) You’ll also need to install selenium for web scraping and a webdriver for Chrome.

Or you can create your own scrapers: http://automatetheboringstuff.com/chapter11/


(Jeremy Howard) #10

That looks like a rather cool book!


#11

Yep, that was the book I used to teach myself Python… and now I’m ready to learn how to use Deep Learning to further automate the boring stuff.


(Hardik Vasa) #12

you can now download images for a specific format using the above github repository

hardikvasa/google-images-download

$ googleimagesdownload -k <keyword> -f jpg


Faster experimentation for better learning
(Jeremy Howard) #14

Terrific! Do you have a twitter handle? Would love to share this project.


(Hardik Vasa) #15

Thanks Jeremy :slight_smile:

I do not have an active Twitter handle but it would be great if you could share this project.

I am adding new features into this repo every week and would love to hear what common features does folks on this forum need. That way I can plan an integrate those features into the repo.


(Jeremy Howard) #16

Are you open to creating one? It’s the best way I have to credit people’s work. It’s also where nearly all my favorite deep learning practitioners and researchers discuss their work.

(Obviously it’s entirely up to you - just wanted to let you know my thinking.)


(Hardik Vasa) #17

Hi @jeremy , sure thing!

re-activated my handle from last year… @hnvasa15 it is :slight_smile:

Thanks!


(Ben Love) #18

I’m halfway through creating a python script to take your downloads from google_images_download and split them by whatever percentages you want.

Before I finish, I just realized :worried: I should make sure what we want is a directory structure like in dogscats/.

dogscats
    |-- train
          |-- cats
                |-- catpic0, catpic1, …
          |-- dogs/
                |-- dogpic0, dogpic1, …
    |-- valid
          |-- cats
                |-- catpic0+x, catpic1+x, …
          |-- dogs
                |-- dogpic0+x, dogpic1+x, …
    |-- test
           |-- catpic0+x+y, catpic1+x+y, dogpic0+x+y, dogpic1+x+y, …

Thanks,
Ben


(Hardik Vasa) #19

@benlove Tip: run this query and you will be amazed :slight_smile:

$ googleimagesdownload --keywords "cats,dogs" -l 1000 -ri -cd <path/to/chromedriver>

Beware of what limit you set here because the above query can go up to 140k + images (more than 70k each) if you would want to build a humongous dataset. You will still have to put it in correct directory structure though.

You can also use the -o argument to specify the name of the main directory. So it does not always have to be ‘downloads/’


(Ben Love) #20

Oh, @hnvasa, that’s cool.
I didn’t consider just making the downloads directory the name I wanted. Much simpler!


(Ronnyronay) #21

Hi @benlove , I have questions regarding directory structure.
Does your directory structure work when running model or should I use similar structure as in dogscats as shown below:

/home/ubuntu/data/dogscats/
├── models
├── sample
│ ├──── models
│ ├──── tmp
│ ├──── train
│ │ ├────── cats
│ │ └────── dogs
│ └──── valid
│ ├────── cats
│ └────── dogs
├── test
├── train
│ ├──── cats
│ └──── dogs
└── valid
├──── cats
└──── dogs

Big Thanks for the answer!

Chadst