in SGD with restarts , why should we learn the most generalized ,instead of finding the lowest loss
is that just suitable for kind of test and training set that have huge differences with each other?
i mean that moving a little bit of figure witch Jeremy mentioned
The spot of the lowest loss on your validation set may not exactly correspond to the same spot on your test set (see the figure presented by Jeremy), resulting in good performance during training but bad performances during testing.
So, instead, you want to pick a wider and flatter area where your loss is more robust to small changes in your data to have comparable performances during training and testing.
is SGD with restart just suitable for kind of test and training set that have huge differences with each other?
and data which has so flat and also low valley (area)
if it isn’t low also , SGD with restart wouldn’t get very good results
and we had better use adam and RMS prop instead!
is that accurate ?