How much does deep learning performance scale with training examples?

bananatachyon · November 6, 2019, 10:20am

Baidu has answered this question empirically, but I don’t have a good background in math so I don’t understand the answer:

Many studies theoretically predict that generalization error “learning curves” take a power-law form, ε(m) ∝ αm^β_g. Here, ε is generalization error, m is the number of samples in the training set, α is a constant property of the problem, and β_g = −0.5 or −1 is the scaling exponent that defines the steepness of the learning curve—how quickly a model family can learn from adding more training samples¹. Unfortunately, in real applications, we find empirically that βg usually settles between −0.07 and −0.35, exponents that are unexplained by prior theoretical work.

Here’s the same text as a screenshot in case any of the math notation doesn’t display properly:

For example, for image classification on ImageNet:

The top-1 classification error exponent is β_g = −0.309. On the other hand, the exponent for top-5 classification error is β_g = −0.488.

How can this be expressed for a non-mathy layperson? For example, how much improvement in accuracy results from a 10x increase in training data?