I am reading the paper by Leslie Smith https://arxiv.org/abs/1803.09820.
In the paper Leslie mentions
… the test loss within the black box indicates signs of overfitting at learning rates …
I understand that as the training loss is less than test loss it means we are overfitting. But isn’t that true for the whole graph? I mean the test loss is more than train loss everywhere. So why does this box indicates signs of overfitting?
Leslie shows this picture in which he mentions that the plateau is what we want to achieve for optimal model.
But same as before the test loss is higher than training loss. Doesn’t that mean we are overfitting already to the training data?
Leslie mentions this
The takeaway message of this Section is that the practitioner’s goal is obtaining the highest performance while minimizing the needed computational time. Unlike the learning rate hyper-parameter where its value doesn’t affect computational time, batch size must be examined in conjunction with the execution time of the training.
I did not understand the part when it is said that learning rate’s value does not affect computational time. The computational time per training, per epoch? What is computational time referring to here?