Advice for fitting extremely large dataset (0.5TB)

garfieldchh · November 3, 2018, 3:13pm

I am relatively new to deep learning and very new to Pytorch.

I am trying to practice what I learned from DL course 1 's first three lessons on an active Kaggle competition.

The current competition https://www.kaggle.com/c/quickdraw-doodle-recognition has an extremely large dataset of 50 Millions images.

Doing a precompute on activation on 10% of the data is estimated to take 8 hrs.

I am wondering if there are better ways.

I need advice on the following aspects.

Efficient image storage:

I preprocessed drawing coordinates from CSV file into images using pyplot. All images are black and white.
The CSV file is 73GB and the processed images are about 0.5 TB. I am wondering if there are improvements that can be done here

parallelize training:

Pytorch docs indicate that training could be speeded up by using multiple GPUs.

Questions:

Where do I get machines with more than one GPU ? All paperspace instances seem to have one GPU…
What do I need to do in order to use Fast.ai library in a parallelized way.

Appreciate your suggestions and helps. Thanks

willismar · November 3, 2018, 4:09pm

Hi

My advise, is that you at some time will need run your algorithm in a bigger dataset, then it must be needed to build your pytorch with Magma support. Because it will use lapack to process the data that will be beyond the memory of all your GPUs but your GPUs will need to have sufficient memory to hold the model.