Hi Everyone - I’m fortunate enough to have the problem of having a ton of training data and I’ve started to wonder about the following question:
Is it better to train on all the data where, because the epochs are so large, learning usually stops after the first or second epoch, or is it better to train on smaller subsets of data where learning happens over many epochs?
Put another way: Is it better for your model to learn on an extremely large set of samples where you see each sample once, or is it better to work with a subset of the total data available but see each sample a number of times.
I am imagining the following:
- break the full training set into N subsets
- Train of the first subset till learning stops
- Then train on the next subset and the next …
- When you no longer see improvement from the current subset stop training (ignoring the remaining subsets)
Is this a good idea?
Note: about a year ago I asked this question. It was a different project but obviously closely related to the above question. Note here however I’m not interested in OOM errors, but rather the best way to train with large amount of training data.