Ideally we would want to find a global minimum of our loss function which should represent “how far away” we are from our desired values. But in practice we may end up with overfitting.
From this paper: https://arxiv.org/abs/1412.0233
We empirically verify several hypotheses regarding
learning with large-size networks:
• For large-size networks, most local minima are equivalent and yield similar performance on a test set.
• The probability of finding a “bad” (high value) local minimum is non-zero for small-size networks and decreases quickly with network size.
• Struggling to find the global minimum on the training set (as opposed to one of the many good local ones) is not useful in practice and may lead to overfitting.