Ad #1 - training a deep learning model can take weeks. Also, the ideal training rate is very situation dependent, meaning it is a piece of information that can be derived locally with this method and that has local significance.
You could potentially test out various training regimes where you would say you start with this LR and decay it exponentially or whatnot, but if you are supervising the training manually you probably would be better off using the lr_finder.
With regards to 2 and 3 (and somewhat 1) there are some best practices that seem to work, but you are free to try whatever seems to make sense to you. If you feel a grid search is applicable to your problem, give it a go.
The way I see it, the information is often presented in coherent chunks to teach us something. But there is no single recipe uniquely applicable. I would venture a guess that we do data augmentation down the road because this just makes it simpler to show us concept one by one. Also, it is a bit more computationally expensive, since we need to run the images through all the layers vs saving the activations of the conv part of the network and training on them - quite a nice way of driving that point!
As to at what point to use data augmentation, I do not know. I came across some people training a model up to some level and only then starting to use data augmentation. Probably training with it from the beginning would be okay as well. But as you progress in the course you will discover there are other considerations to take into account For instance, whether you are overfitting or not. This will be the driving factor in many regards.