I am relatively new to deep learning and very new to Pytorch.
I am trying to practice what I learned from DL course 1 's first three lessons on an active Kaggle competition.
The current competition https://www.kaggle.com/c/quickdraw-doodle-recognition has an extremely large dataset of 50 Millions images.
Doing a precompute on activation on 10% of the data is estimated to take 8 hrs.
I am wondering if there are better ways.
I need advice on the following aspects.
Efficient image storage:
I preprocessed drawing coordinates from CSV file into images using pyplot. All images are black and white.
The CSV file is 73GB and the processed images are about 0.5 TB. I am wondering if there are improvements that can be done here
parallelize training:
Pytorch docs indicate that training could be speeded up by using multiple GPUs.
Questions:
- Where do I get machines with more than one GPU ? All paperspace instances seem to have one GPU…
- What do I need to do in order to use Fast.ai library in a parallelized way.
Appreciate your suggestions and helps. Thanks