I’ve never seen a non linear activation applied to the input data prior to combining it with the hidden weights.
I would have expected it to be something like
tanh(x_inputx_weights + hidden_inputhidden_weights)
however it looks like,
tanh(relu(x_inputx_weights) + hidden_inputhidden_weights)
Is this because the input is an embedding, or just a slight tweak that may or may not work better?