For larger datasets, I was experimenting with pytables - the IO on AWS with reasonable volumes seems to be very limiting, especially with smaller files. I also experimented with other types of drives, etc.
Once we go from an image to a matrix, this suddenly becomes huge. Problem is - I never found a good way of compressing it to conserve disk space. I think with the 70GB of data for the cdiscount competition, once read in as matrices, even with compression, the data becomes over 1TB or something like that, maybe even closer to 2TB IIRC.
Guess this is just how it is but I never found a good way of compressing image data in matrix form for storage. Chances are algorithms for image compression are just so highly specialized nothing else generic comes even close… and the whole idea of reading in images and saving them as matrices is just silly.
BTW if you have a dataset this size, does shuffling still even make sense?