Working with extremely large amounts of training data

Hi Everyone - I’m fortunate enough to have the problem of having a ton of training data and I’ve started to wonder about the following question:

Is it better to train on all the data where, because the epochs are so large, learning usually stops after the first or second epoch, or is it better to train on smaller subsets of data where learning happens over many epochs?

Put another way: Is it better for your model to learn on an extremely large set of samples where you see each sample once, or is it better to work with a subset of the total data available but see each sample a number of times.

I am imagining the following:

  • break the full training set into N subsets
  • Train of the first subset till learning stops
  • Then train on the next subset and the next …
  • When you no longer see improvement from the current subset stop training (ignoring the remaining subsets)

Is this a good idea?

Note: about a year ago I asked this question. It was a different project but obviously closely related to the above question. Note here however I’m not interested in OOM errors, but rather the best way to train with large amount of training data.

45 000 images is good is you use a pretrained network. In general using many different experience generalize better. You do not describe the categories and number of image per category ?

Thanks Kasper. Yes, I’ve been purposely vague here.

I’m interested general best practices/guiding principles rather than specifics for a particular problem. My gut tells me that as long as you have a “sufficient” amount of, and “sufficient” variability within your, training data adding new samples offers marginal improvement but returning to the same sample multiple times may be a benefit.

Sadly my gut isn’t always reliable and I have failed to define “sufficient”. I can of course try both methods and just choose the one with better results ( I also think there’s probably a smart way to attack this mathematically ) but I was curious if this was already well understood.

I’m doing this currently for a project. I split the data into clumps. If I have 3,000 images to work with, 4 clumps or so of 700 images. Train with 700, valid with 200, then 1200, 400, etc until I have the full dataset in there. For three times, as one clump will permanently be used as validation.

Interesting - so you’re increasing the clump size as you get further into training. Have you compared your results to just training on all 3000 images?