Recent ImageDataBunch Changes Slows Down from_csv

I think the changes in this commit, “Refactor”, really slow down the creation of an ImageDataBunch when using from_csv.

I think the issue stems from the addition of ImageFileList in the definition of from_df, which is used by from_csv. I am working with a directory that contains millions of images and hen I started working with v1 for this task, I was using from_folder to create my ImageDataBunch on a small set, but when I moved to the directory with millions of images it wouldn’t even create the ImageDataBunch (or at least I was too impatient after an hour).

I switched to using from_csv that worked much better - creating my ImageDataBunch in tens of seconds. However, when I went to run the same script tonight my smaller test set loaded in 10 minutes and my larger set didn’t even load in half an hour.

I imagine this was an unintentional side effect of the change. Two questions then:

  1. Is there a new recommended method for working with large sets of images?
  2. If not and this was an unintended change - is there a way to modify from_df so that it does not use from_folder, since the latter is what slows things down.

This will be changed in our latest refactor that will come very soon (tomorrow I hope). I’ll also investigate why from_folder is so slow.

3 Likes

Just wanted to confirm the issue. I am also working on a directory with millions of images. I had it running on a very large instance for 48 hours and it never finished. I downgraded back to the 1.0.21 release and it was able to go through the images in less then 5 minutes.

2 Likes

Same issue here, still occurs in version 1.0.24, solved by going back to 1.0.21

1 Like

The from_folder hasn’t been fixed, but from_csv or from_df should be as fast as in 1.0.21.

2 Likes

Hi,
The new API is so awesome!

It is still slow, for the function split_by_idx.
I profiled it and it is the list comprehension

I rewrote a version using set logic, almost 2 orders of magnitude faster.

Please check my code, if you approve I will raise a PR:

file: data_blocks.py, line 110

1 Like

You are the second person to see this, I changed it yesterday on master :wink:
I used numpy, will compare with your proposal to see which one is faster. I still have to dig more into why get_files (for the from_folder method) is so slow.

2 Likes

I am using release 1.0.52 and I am finding it very slow to use ObjectItemList with from_df or from_csv. It can take a couple of hours to run.

This is for 110,000 images. When doing the equivalent in v0.7, it took a few seconds. I will dig deeper but wanted to raise in case someone had already solved this.

1 Like

I’m sorry but it looks like I wasn’t using *52 but an older version. It seems to work now.

1 Like

Hi,

I am trying to use ImageDataBunch.from_folder with fastai 1.0.54 and the ImageNet database. It takes about 2 hours to complete. :confused:

I am using this simple script to reproduce the time required:

from fastai import *
from fastai.vision import *

path = Path("../ImageNet/")

data = ImageDataBunch.from_folder(path, valid="val", test="test", size=224, num_workers=32)

But if I use ImageClassifierData.from_paths from fastai 0.7, it takes 2 minutes.

@jeremy Any clue on why this might be happening? Thanks! :slight_smile: