In the first lesson of Part 1 v2, Jeremy encourages us to test the notebook on our own dataset. I know that there are some dataset already existing on Kaggle but it would certainly be nice to construct our personal ones to test our own ideas and find the limits of what neural networks can and cannot achieve.
Several people already indicated ways to do this (at least partially) and I thought it might be nice to try to make a special tread for it, where we regroup these ideas.
I’m a real beginner with very little experience, so I will try to do a detailed list of the steps required to get an image dataset, and then reference what people mentioned on this forum to do it.
Download a set of images from somewhere.
Make sure they have the same extension (.jpg or .png for instance)
Make sure that they are named according to the convention of the first notebook i.e. class.number.extension for instance cat.14.jpg)
Split them in different subsets like train, valid, and test.
If we used https://github.com/hardikvasa/google-images-download to download images we get both .png and .jpg files. Until now all I can do is remove all the .png but maybe it is possible to convert .png into .jpg or the other way around.
I guess it shouldn’t be that hard with some bash scripting or the right python libraries but I don’t know anything about it. If someone knows some tutorial to learn how to manipulates files and directories with python I would be glad to have a reference.
If someone has a script for points 2) and 3) it would be nice to share it. And if some of you have recommendations/experience concerning the creation of an image dataset, it would of course be cool to share it too.
If you are on Windows, then navigate to that particular directory where you have your .png files, just run the following command in cmd ren *.* *.jpg.(warning it will cahnge all files to png, make sure you are in the correct place or have a copy of all the files) or the safer version ren *.png *.jpg.
If you are on Ubuntu, then type rename .png .jpg (not quite sure) but you can surely do man rename
We can interchange *.png to *.jpg , It will not cause any problems…
this repository does the re-naming into 1.jpg, 2.jpg … etc
I doubt renaming files from *.png to *.jpg actually does any conversion (at least via mv) — png and jpg are two very different image formats. A handy-dandy command-line utility for manipulating images is imagemagick. You can use apt-get on linux or brew install on osx to install it on your system. Afterwards, you can batch convert like so:
for i in *.png ; do convert "$i" "${i%.*}.jpg" ; done
where convert is part of the imagemagick toolbox. You will still want to verify by hand a couple of images that the conversion went thru as expected (sometimes, pngs with transparent background can confuse imagemagick — google if you are stuck).
I created my own cats and dogs validation dataset by scrapping some dogs and cats photo from http://www.catbreedslist.com. It has high definition photos of 65 breeds of cats and 369 breeds of dogs.
Though the file names were different from the standard, it worked just fine just as Jeremy has mentioned above. Though you need to maintain the folder structure.
It gave me a 100% accuracy on the already trained model. one difficulty that i faced was i couldn’t find where to specify the location of the new validation dataset. i had to rename it “valid” and change the old “valid” to something else.
I created a Pinterest scraper a while ago which will download all the images from a Pinterest board or a list of boards. It hasn’t been maintained in over a year so use at your own risk (and as of this writing, only supports Python 2.7 but I plan to update it once I get to that part in this lesson.) You’ll also need to install selenium for web scraping and a webdriver for Chrome.
I do not have an active Twitter handle but it would be great if you could share this project.
I am adding new features into this repo every week and would love to hear what common features does folks on this forum need. That way I can plan an integrate those features into the repo.
Are you open to creating one? It’s the best way I have to credit people’s work. It’s also where nearly all my favorite deep learning practitioners and researchers discuss their work.
(Obviously it’s entirely up to you - just wanted to let you know my thinking.)
Beware of what limit you set here because the above query can go up to 140k + images (more than 70k each) if you would want to build a humongous dataset. You will still have to put it in correct directory structure though.
You can also use the -o argument to specify the name of the main directory. So it does not always have to be ‘downloads/’
Hi @benlove , I have questions regarding directory structure.
Does your directory structure work when running model or should I use similar structure as in dogscats as shown below: