Thanks for your reply.
But what I don’t understand is what does activations[:,1:] - activations[:,:-1] calculating?
Is it calculating how far are two words in the vocabulary?
It is calculating the difference between the activations of nth word and n-1th word.
Remember that we are predicting next word for each input word. So if sl = 3, we will predict 3 words. Hence, we will have 3 activations (see: Chapter 12 - Creating More Signal).
When using TAR, our goal is to keep difference between two consecutive activations as small as possible.
So to achieve this, we take difference between each consecutive activations and if the differance is big, we will add more penalty to loss which penalizes the model to eventually output activations that are closer to each other.