Tips & Tricks for Transformers

I’m training models with Transformers architecture for NLP tasks, and I wanted to know if you had any tips on how to select hyperparameters?
It will be great to get insight from your own experiment or academic papers.

  1. how to choose Optimizer?
  2. what is the preferred batch_size?
  3. how many epoch is recommended?
  4. do you recommend some architecture change (compare to Bert)?
  5. how do you choose the LR cycle?
  6. vocabulary size?
  7. dimensions size?
  8. the number of layers?
  9. any other tricks/tips?