Recent ImageDataBunch Changes Slows Down from_csv

Tchotchke · November 12, 2018, 2:45am

I think the changes in this commit, “Refactor”, really slow down the creation of an ImageDataBunch when using from_csv.

I think the issue stems from the addition of ImageFileList in the definition of from_df, which is used by from_csv. I am working with a directory that contains millions of images and hen I started working with v1 for this task, I was using from_folder to create my ImageDataBunch on a small set, but when I moved to the directory with millions of images it wouldn’t even create the ImageDataBunch (or at least I was too impatient after an hour).

I switched to using from_csv that worked much better - creating my ImageDataBunch in tens of seconds. However, when I went to run the same script tonight my smaller test set loaded in 10 minutes and my larger set didn’t even load in half an hour.

I imagine this was an unintentional side effect of the change. Two questions then:

Is there a new recommended method for working with large sets of images?
If not and this was an unintended change - is there a way to modify from_df so that it does not use from_folder, since the latter is what slows things down.

sgugger · November 12, 2018, 3:52am

This will be changed in our latest refactor that will come very soon (tomorrow I hope). I’ll also investigate why from_folder is so slow.

sariabod · November 13, 2018, 8:32pm

Just wanted to confirm the issue. I am also working on a directory with millions of images. I had it running on a very large instance for 48 hours and it never finished. I downgraded back to the 1.0.21 release and it was able to go through the images in less then 5 minutes.

nextM · November 14, 2018, 10:10am

Same issue here, still occurs in version 1.0.24, solved by going back to 1.0.21

sgugger · November 14, 2018, 2:23pm

The from_folder hasn’t been fixed, but from_csv or from_df should be as fast as in 1.0.21.

nextM · November 16, 2018, 1:43pm

Hi,
The new API is so awesome!

It is still slow, for the function split_by_idx.
I profiled it and it is the list comprehension

I rewrote a version using set logic, almost 2 orders of magnitude faster.

Please check my code, if you approve I will raise a PR:

file: data_blocks.py, line 110

gist.github.com

https://gist.github.com/dragonflowerai/f2adef1b40a789ff83cbf378ba9020b3

data_blocks.py

def split_by_idx(self, valid_idx:Collection[int])->'ItemLists':
    "Split the data according to the indexes in `valid_idx`."
    train_idx = list(set(np.arange(len(self.items)))-set(valid_idx))
    return self.split_by_idxs(train_idx, valid_idx)

sgugger · November 16, 2018, 2:03pm

You are the second person to see this, I changed it yesterday on master
I used numpy, will compare with your proposal to see which one is faster. I still have to dig more into why get_files (for the from_folder method) is so slow.

tombishop · May 8, 2019, 3:49pm

I am using release 1.0.52 and I am finding it very slow to use ObjectItemList with from_df or from_csv. It can take a couple of hours to run.

This is for 110,000 images. When doing the equivalent in v0.7, it took a few seconds. I will dig deeper but wanted to raise in case someone had already solved this.

tombishop · May 9, 2019, 2:21pm

I’m sorry but it looks like I wasn’t using *52 but an older version. It seems to work now.

yashbhalgat · July 3, 2019, 4:58pm

Hi,

I am trying to use ImageDataBunch.from_folder with fastai 1.0.54 and the ImageNet database. It takes about 2 hours to complete.

I am using this simple script to reproduce the time required:

from fastai import *
from fastai.vision import *

path = Path("../ImageNet/")

data = ImageDataBunch.from_folder(path, valid="val", test="test", size=224, num_workers=32)

But if I use ImageClassifierData.from_paths from fastai 0.7, it takes 2 minutes.

@jeremy Any clue on why this might be happening? Thanks!