I’m training models with Transformers architecture for NLP tasks, and I wanted to know if you had any tips on how to select hyperparameters?
It will be great to get insight from your own experiment or academic papers.
- how to choose Optimizer?
- what is the preferred batch_size?
- how many epoch is recommended?
- do you recommend some architecture change (compare to Bert)?
- how do you choose the LR cycle?
- vocabulary size?
- dimensions size?
- the number of layers?
- any other tricks/tips?