Baidu has answered this question empirically, but I don’t have a good background in math so I don’t understand the answer:
Many studies theoretically predict that generalization error “learning curves” take a power-law form, ε(m) ∝ αmβg. Here, ε is generalization error, m is the number of samples in the training set, α is a constant property of the problem, and βg = −0.5 or −1 is the scaling exponent that defines the steepness of the learning curve—how quickly a model family can learn from adding more training samples1. Unfortunately, in real applications, we find empirically that βg usually settles between −0.07 and −0.35, exponents that are unexplained by prior theoretical work.
Here’s the same text as a screenshot in case any of the math notation doesn’t display properly:
For example, for image classification on ImageNet:
The top-1 classification error exponent is βg = −0.309. On the other hand, the exponent for top-5 classification error is βg = −0.488.
How can this be expressed for a non-mathy layperson? For example, how much improvement in accuracy results from a 10x increase in training data?