Language model: efficient way of getting hidden representation of document


I’ve trained a language model using fastai’s Learner class. From there I’ve extracted the encoder, and with the encoder I can use the final hidden state as a vector representation of a whole document (to use as features for clustering, ranking, classification etc. downstream).

I can loop through each document and get the hidden state one by one without any issues. However, I’m having problems constructing minibatches of data to get the hidden representations of a set of documents in an efficient way. To get the activations of a minibatch, pytorch must pad the input sequences so they’re all the same length of the maximum sequence. However, upon inspecting the values and the cosine similarities of the vectors, it looks like these sentences:

encoder("this is a sentence .")
encoder("this is a sentence . <pad> <pad> <pad>")

give very different hidden representations. I.e. the extra padding in a sentence significantly alters the hidden state. My guess is because in the training of the language model, the learner never actually sees the <pad> token and so its embeddings are poorly defined. (The data for the language model is just a concatenation of all documents.)

What’s the solution here? Iterating through each document one by one is really inefficient. But I don’t see a way of batching them up correctly. Perhaps during training I could potentially add random sequences of <pad> to make the language model know that those tokens shouldn’t affect the hidden representation? What’s the best way to get hidden states of documents as numpy arrays?

Thanks again Jeremy for your wonderful MOOC!


Did you ever figure this out? I noticed the same thing: Whenever I run predictions with minibatches, I use the padding token (has index 1 as default) to make all tokenized sentences in the batch the same length. The predictions for the sentences that were padded are slightly different compared to when I make a single prediction with the unpadded sentence. I tried using padding before the sentence, behind, and between the xbos token sequence and the sentence.

  1. Unpadded : ['xbos', 'xfld', '1', 'i', 'like', 'cats']
  2. ['_pad_', 'xbos', 'xfld', '1', 'i', 'like', 'cats']
  3. ['xbos', 'xfld', '1', 'i', 'like', 'cats', '_pad_']
  4. ['xbos', 'xfld', '1', '_pad_', 'i', 'like', 'cats']

As you noted, the hidden state in all cases is different and so are the predictions. Putting the padding in front (line 2) seems to be closest to the unpadded version. Putting the padding at the end (line 3) had the worst predictions in my experiments.

(dsa cryax) #3

Did you guys find out how to deal with this correctly? @JensF @nigel


Nope, I never figured this out. However, github recently published a blog post for their semantic search machine learning model, and it seems like they use fastai’s language model library.

This medium post has more details on their process, so maybe it might shed some light there? I haven’t had time to look into this properly.

@hamelsmu I saw that you liked my initial post and you wrote the actual github post. Would you mind sharing some tips? The work you’ve done is really cool!

(Hamel Husain) #5

I would not use padding and just retrieve hidden states one example at a time to start with. When you apply padding during inference you can destroy the hidden states esp because training doesn’t really use padding.

If you apply padding before it will definitely mess up the hidden states. If you apply padding after you have to pull the hidden state before the timestep where padding is introduced, which will be different for each example so you will have to do some accounting. Either way if you want to do this verify that your hidden states are consistent when you compare doing one example at a time vs. batch.

In the blog post I have a for loop that extracts the embeddings one example at a time because I didn’t have time at that moment to do it in batch.


(Hamel Husain) #6

Also check your assumption of using the last hidden state- do you really think it encodes the whole text or just the last portion? Compare with averaging the hidden states. It’s important to experiment and think about this carefully


I see, so you used a for loop too.

Also check your assumption of using the last hidden state- do you really think it encodes the whole text or just the last portion?

Good call! I see that you averaged across all hidden states and also mentioned concat pooling in your paper. I’ll keep that in mind, thanks for your response :slight_smile:

(dsa cryax) #8

I also used for loop for batch prediction but got memory issue when batch size is greater than 20 (mem increased and then crash)

(Jeremy Howard (Admin)) #9

Also you might want to use pre-padding instead of post-padding, and maybe even insert a little padding into your LM data so your model learns to handle it.