Experiments on using Redis as DataSet for fast.ai for huge datasets

devforfu · December 6, 2018, 10:44am

Definitely sounds like an interesting idea.

Just a small remark, I’ve generated 50K per class (~17M PNG images 256x256, RGB) in a few hours as I can recall, on my machine. At least, much faster then 325/3 hours Not sure why it takes so long on AWS? I believe the whole dataset conversion shouldn’t take too much time if you have enough cores and enough space on your SSD. Also, for that specific competition, I was generating images on the fly with a custom Dataset implementation.

Though probably you’ve used the “full” dataset instead of simplified one?

I am not sure about Alternative 2 though because of this thing. I am doing exactly this with my custom dataset and getting a huge memory leak. (Though probably I am doing something wrong also). So in my case, the worst problem is that old batches are kept in the RAM during training epoch.

@marcmuc How do you think, can we get the same leakage even in the case of Redis? Like, at the end of the day, we still need to put the data into RAM and if DataLoader is a guy who leaks memory due to python-specific multiprocessing implementation, then Redis will not help too much.