I think the changes in this commit, “Refactor”, really slow down the creation of an ImageDataBunch when using from_csv.
I think the issue stems from the addition of ImageFileList in the definition of from_df, which is used by from_csv. I am working with a directory that contains millions of images and hen I started working with v1 for this task, I was using from_folder to create my ImageDataBunch on a small set, but when I moved to the directory with millions of images it wouldn’t even create the ImageDataBunch (or at least I was too impatient after an hour).
I switched to using from_csv that worked much better - creating my ImageDataBunch in tens of seconds. However, when I went to run the same script tonight my smaller test set loaded in 10 minutes and my larger set didn’t even load in half an hour.
I imagine this was an unintentional side effect of the change. Two questions then:
Is there a new recommended method for working with large sets of images?
If not and this was an unintended change - is there a way to modify from_df so that it does not use from_folder, since the latter is what slows things down.
Just wanted to confirm the issue. I am also working on a directory with millions of images. I had it running on a very large instance for 48 hours and it never finished. I downgraded back to the 1.0.21 release and it was able to go through the images in less then 5 minutes.
You are the second person to see this, I changed it yesterday on master
I used numpy, will compare with your proposal to see which one is faster. I still have to dig more into why get_files (for the from_folder method) is so slow.
I am using release 1.0.52 and I am finding it very slow to use ObjectItemList with from_df or from_csv. It can take a couple of hours to run.
This is for 110,000 images. When doing the equivalent in v0.7, it took a few seconds. I will dig deeper but wanted to raise in case someone had already solved this.