In lesson 2, Jeremy explained some common problems, one of them was “too high learning rate” and the effect for this situation was massively increased validation loss. I understand when we increase the learning rate too much, gradient descent simply can’t converge the minimum but instead it diverges. What I don’t really get is: how come our training loss increases only a bit, but validation loss increases massively? If we have a diverged model isn’t it supposed to perform equally poorly on the training set and validation set? I would be glad if anyone can help.
Thanks in advance