I have been playing around with a couple of datasets and noticed that if one changes the batch size and the image size, the optimal learning rate seems to be about the same, but changing the architecture changes the learning rate…
Have others observed this phenomenon? Is the optimal learning rate dependent on the dataset and architecture but not on the batch size and image size? How do transformations affect learning rate?