Reading in lots of files on linux - performance considerations

Situation: Several millions of directories (representing class labels) with small images inside

On AWS:
Because of the IO ops count limit per second, iterating over the dataset takes forever and IO performance is the bottleneck. I found organizing data using pytables really helped and experimented with chunk sizes, etc.

Question:
How does performance look like on a regular box, with local SSD / HDD? Meaning, how big is the penalty hit if reading small files from a lot of directories vs reading small files from a single directory? Can I expect a significant improvement if I read in data from HDF5 files?

Probably the answer here is go and see if performance is adequate starting with the simplest solution, but was just curious if maybe anyone has any experience on this / any theoretical knowledge that they would be willing to share :slight_smile:

In general - does amount of files / reading from various directories visibly impact reading in speed on an ext4 filesystem?

Sequential reading is significantly faster. I don’t have numbers at hand but an order of magnitude faster than random seek. You can expect performance gain if you read batches of images from hdf5.

1 Like