Training a model when dataset size is too big to fit the hard drive

torayeff · May 19, 2019, 6:22pm

I have faced this situation: I have a large dataset of size 1 TB, but my local drive has the capacity of 500GB. I can train a model only on a smaller subset of the original dataset, but I would like to make use of the whole dataset. So I wonder if there is a method or paper about the training of a model in these kinds of situations.

muellerzr · May 19, 2019, 8:09pm

You can use semi-supervised learning here or transfer learn your model to new datasets. Look into them, if you have questions let me know.

torayeff · May 19, 2019, 8:33pm

Thanks, for the reply. I am training an autoencoder to learn features of the data. So it is already an unsupervised learning.

muellerzr · May 19, 2019, 8:54pm

You can also incrementally train the model by feeding it batches, and do a learn.load, and train again is what I am meaning.