NLP Course: Transformer model hyper parameter tuning

Hi,

I checked out transformer model from the Fast.ai NLP course to understand the concepts of attention. I was working on the tensorflow implementation Tensorflow Transformer when I came across Rachel’s video. I ran the notebook and it gave me decent results using the parameters given in the tutorial:

num_layers = 4
d_model = 128
dff = 512
num_heads = 8
dropout_rate = 0.1

As per the suggestion from the same docs (values from Attention is all you need paper), I used the following values for 50 epochs and it gives gibberish results. Has anyone tried to perform hyper parameter tuning for the transformer model in general? What is the range of warmup steps we would need to change to achieve good results.

num_layers=6, d_model = 512, dff = 2048. 

#Results
Input: Hola! cómo estás
Predicted translation: Hi! Hi! Hi! Hi! Hi! Hi! Hi! Hi! Hi! Hi! Hi! Hi! 
Hi! Hi! Hi! Hi! Hi! Hi! Hi! Hi!

What was the accuracy % of your above model ?

I tried increasing the Epochs ( 120 ) and reducing the number of layers (3) and was getting better and meaningful responses.