I’ve noticed that when I tune my SGD hyperparams, they don’t generalize on the same dataset when the size increases.
Basically, what I do is that I use say 10% of the dataset to do hyperparam testing. Once I have the right settings, I train on the full dataset. In every case, the loss blows up at about mid-way when using the full dataset, even though the learning rate, cyclical settings, momentum, etc. is identical.
Has anyone observed a similar thing? Is there like a heuristic or rule when training on large datasets, where the hyperparams needs to be adjusted downwards (most likely need to reduce learning rate and/or no. of epochs)?