In the first lesson of Part 1 v2, Jeremy encourages us to test the notebook on our own dataset. I know that there are some dataset already existing on Kaggle but it would certainly be nice to construct our personal ones to test our own ideas and find the limits of what neural networks can and cannot achieve.
Several people already indicated ways to do this (at least partially) and I thought it might be nice to try to make a special tread for it, where we regroup these ideas.
I’m a real beginner with very little experience, so I will try to do a detailed list of the steps required to get an image dataset, and then reference what people mentioned on this forum to do it.
- Download a set of images from somewhere.
- Make sure they have the same extension (.jpg or .png for instance)
- Make sure that they are named according to the convention of the first notebook i.e. class.number.extension for instance cat.14.jpg)
- Split them in different subsets like train, valid, and test.
Here is what I found and was able to use
This python script create a different folder for each keyword. I tested and it works fine.
- If we used https://github.com/hardikvasa/google-images-download to download images we get both .png and .jpg files. Until now all I can do is remove all the .png but maybe it is possible to convert .png into .jpg or the other way around.
- I guess it shouldn’t be that hard with some bash scripting or the right python libraries but I don’t know anything about it. If someone knows some tutorial to learn how to manipulates files and directories with python I would be glad to have a reference.
- I think that create_sample_folder presented here:
would do the job.
If someone has a script for points 2) and 3) it would be nice to share it. And if some of you have recommendations/experience concerning the creation of an image dataset, it would of course be cool to share it too.