Loading ~ 0.8 million text files from 14 folders into pandas in 45s on a mechanical hard drive

I know this may not be an issue for a lot of you with SSDs, but I don’t have much experience dealing with many files on a mechanical hard drive, and after hours of frustration of not able to quickly get files into order, I found this SO answer helpful.

I wanted to prepare 836k files from 14 directories to build a language model (following lesson 10). I played around with a few methods, but after Jeremy’s advice, I realized fastai.text doesn’t require your directory to be in a rigid structure (unlike torch.text), so I gave up on moving files around (which took forever on a mechanical drive), and decided to load everything in memory (a pretty obvious and upfront option, although speeds still vary quite a bit).

This method using pandas on a list of dicts actually gave me almost 1000X+ performance boost (44.5 s ± 155 ms for all the files) compared with the first get_texts() method for the imdb dataset used in lesson 10 (took 20min+ for 10k files in 2 directories not sure why it took so long). Plus np.array() on a big list I had threw an error due to size restrictions. It may not be a fair comparison but the speed gain is significant according to various comments from the SO link above. All my files were sitting in directories named after each category in the parent directory “data_raw”, and they add up to ~2GB on the disk. I estimate they take a few 100MB of memory when loaded into pandas.

I’m sure this is not the most elegant method, but I’m quite happy with it for now, and I welcome any comments on how to improve (I know multithreading will likely make it faster).

The moral of the story is that, whenever you have a chance please upgrade to an SSD. However, it’s not the end of the world if you are stuck with an old mechanical drive :smiley:

8 Likes

Yeah I was using files on my big drive (4TB HDD) and wow that was slow.
Moved them to my mSATA and got huge performance increase as well.

1 Like