How to handle massive(5mil) image libraries

timbo72 · November 2, 2018, 9:24am

Hi folks, I hope week two is treating you well.

I’m one of the many people who are playing with the Quickdraw Doodle Recognition Kaggle competition and I’ve hit a bit of a snag.

So here I am with a folder of 5million image files (10% of the available dataset. 12 hrs to generate at 128X128). step one of the fastai code is fnames = get_image_files(path_img) and this is where I fall. After running for six hours I gave up, it’s pretty obvious that the problem is directly related to the sheer volume of images.

Is there any way to handle such a massive image library?

timbo72 · November 2, 2018, 10:13am

ok, bit of an update to my own question.

the point of get_image_files is to pass a list of filenames to ImageDataBunch which then uses the selected function (in this case from_name_re which uses regex) to get a list of classes
In the documentation for ImageDataBunch there are a bunch of different ways to provide this list including, dun du du dah du dah (those are tumpets) manually! I can just create a list of my 340 classes and pass it directly to ImageDataBunch. The theory being that I don’t need to run a function on 5million files before we can start batching.

so, i’ll knock up a list of classes, pass it to ImageDataBunch.from_lists(blah,blah,blah) and see how we do.

insoluble · November 2, 2018, 10:20am

you can format your post by using backticks to prettify the lines of code .

timbo72 · November 2, 2018, 9:28pm

ok then, so lists didn’t work I won’t go into it but basically I mis-understood what ImageDataBunch is trying to do. This has turned into a bit of a blog post now but I’m still living in hope that my knight in shining armour will ride in and fix my problem.
my next step will be to try to create a list of filenames and make a dataframe…