Loading data via generator instead of folders

satya · November 5, 2017, 7:34pm

The current data loading mechanism involves having the images for each class saved separately in their own folders. But for competitions like the Kaggle Cdiscount contest which has around 5000 classes and 12 million images this is not a practical option. Would it be easy to customize the dataloader so it accepts a batch generator?

jamesrequa · November 5, 2017, 7:44pm

You can actually use ImageClassifierData.from_csv instead of from_paths to parse a csv file so you don’t need to create subdirectories for each class.

alup · November 5, 2017, 7:56pm

In the specific competition images are represented using a binary format inside a huge bson dump file. You can either extract the images and save them in separate files using the subfolder/class strategy or use bson loader directly and read image data in batches using mongo driver. I have used the file based approach and so far I have met no problem. If you have a lot of disk space then this option is good enough.

satya · November 5, 2017, 8:00pm

I tried that, but I’ve been having inode issues even with more than sufficient free space. I saw a kernel on the competition using Keras’ ImageDataGenerator and wondered if we could do something like that with fastai

Robi · November 5, 2017, 8:22pm

I wonder if a generator using bcolz arrays would be of any help in this situation: for the train data shuffling would probably be required, and I remember from the documentation that bcolz is excellent at accessing sequential data but not so good for random access. Anybody tried or just considered?

jeremy · November 5, 2017, 8:33pm

Yup you can use from_arrays() to create a data object from bcolz. Take a look at the various from_* class methods in dataset.py to see how they all work - very easy to add your own (and feel free to send a PR if you think it would be useful to others).

Robi · November 5, 2017, 8:44pm

So shuffling bcolz data arrays should bring no performance concerns? Will give it a try. Thanks!

jeremy · November 5, 2017, 8:45pm

Be sure to set chunklen=1 - then it’ll work fine.

Robi · November 5, 2017, 8:45pm

Perfect!

EricPB · December 5, 2017, 7:17pm

I think I’m doing something very wrong with that Cdiscount challenge.

After figuring out how to extract the train images from the BSON file, to a dedicated ‘train’ directory with 5270 sub-directories, aka product categories, I used the ‘fish.ipynb’ as a starting point since it seemed most similar in terms of structure.

When I tried to move all images from sub-directories into one common images, the command
!cp {PATH}train/*/*.jpg {PATH}images/ generated an error
/bin/sh: 1: cp: Argument list too long.

After some research (I never dealt with such massive amount of files before), I found the os.walk() function and created a Jupyter cell:

for path, subdirs, files in os.walk(r'data/cdiscount/train/'):
    for filename in files:
        full_file_name = os.path.join(path, filename)
        shutil.copy(full_file_name, 'data/cdiscount/images/')

That command has been running for 16 hours now (since 03:30AM when Lesson 6 started broadcast this morning) and it seems to be only 35% done (4M files out of 12M).
I suspect os.walk() is spending more CPU ressources and power to map back & forth the structure than copying the files.

It’s a pity not to master the basic data wrangling skills to move past that technical threshold and focus on DL.
(And then that Passenger Screening Algorithm Challenge: WTF ?!?!)

E.

jeremy · December 6, 2017, 4:32am

I googled for that error since I know I’ve seen discussion of this before: https://www.google.com/search?q=“cp%3A+Argument+list+too+long”&ie=utf-8&oe=utf-8&client=firefox-b-1-ab

For me at least, this is the first result: https://askubuntu.com/questions/217764/argument-list-too-long-when-copying-files

It suggests using find or rsync, which are both great suggestions. I’d actually suggest rsync, since it’s awesome and worth learning.

HTH!