Understanding Temporal Activation Regularization. - Chapter 12 Language Model

Our activations tensor has a shape bs x sl x n_hid, and we read consecutive activations on the sequence length axis (the dimension in the middle).

With this, TAR can be expressed as follows:

loss += beta * (activations[:,1:] - activations[:,:-1]).pow(2).mean()

Can someone explain in more detail how TAR is being calculated?

I think the point of TAR is to have your weights for activations[:,1:] and activations[:,:-1] to be similar, which will minimize loss.

Similar weights suggest that consecutive tokens should not have very different “meanings”

If indexing is what you are confused about, take a look below to understand how
the third dimension is inferred automatically:

>>> import torch
>>> foo = torch.zeros([5,5,5])
>>> foo.size()
torch.Size([5, 5, 5])
>>> foo[:, 1:].size()
torch.Size([5, 4, 5])
>>> foo[:, :-1].size()
torch.Size([5, 4, 5])

Then we are subtracting two tensors while ignoring the 1st sequence in one tensor and last sequence in another:

activations[:,1:] - activations[:,:-1]

Then tensor is being squared and mean function is applied (which reduces all dimension by default).

bar.pow(2).mean()

Thanks for your reply.
But what I don’t understand is what does activations[:,1:] - activations[:,:-1] calculating?
Is it calculating how far are two words in the vocabulary?

It is calculating the difference between the activations of nth word and n-1th word.

Remember that we are predicting next word for each input word. So if sl = 3, we will predict 3 words. Hence, we will have 3 activations (see: Chapter 12 - Creating More Signal).

When using TAR, our goal is to keep difference between two consecutive activations as small as possible.

So to achieve this, we take difference between each consecutive activations and if the differance is big, we will add more penalty to loss which penalizes the model to eventually output activations that are closer to each other.

2 Likes

Thank you. That makes sense.