Language model: efficient way of getting hidden representation of document

I’ve trained a language model using fastai’s Learner class. From there I’ve extracted the encoder, and with the encoder I can use the final hidden state as a vector representation of a whole document (to use as features for clustering, ranking, classification etc. downstream).

I can loop through each document and get the hidden state one by one without any issues. However, I’m having problems constructing minibatches of data to get the hidden representations of a set of documents in an efficient way. To get the activations of a minibatch, pytorch must pad the input sequences so they’re all the same length of the maximum sequence. However, upon inspecting the values and the cosine similarities of the vectors, it looks like these sentences:

encoder("this is a sentence .")
encoder("this is a sentence . <pad> <pad> <pad>")

give very different hidden representations. I.e. the extra padding in a sentence significantly alters the hidden state. My guess is because in the training of the language model, the learner never actually sees the <pad> token and so its embeddings are poorly defined. (The data for the language model is just a concatenation of all documents.)

What’s the solution here? Iterating through each document one by one is really inefficient. But I don’t see a way of batching them up correctly. Perhaps during training I could potentially add random sequences of <pad> to make the language model know that those tokens shouldn’t affect the hidden representation? What’s the best way to get hidden states of documents as numpy arrays?

Thanks again Jeremy for your wonderful MOOC!

1 Like

Did you ever figure this out? I noticed the same thing: Whenever I run predictions with minibatches, I use the padding token (has index 1 as default) to make all tokenized sentences in the batch the same length. The predictions for the sentences that were padded are slightly different compared to when I make a single prediction with the unpadded sentence. I tried using padding before the sentence, behind, and between the xbos token sequence and the sentence.

  1. Unpadded : ['xbos', 'xfld', '1', 'i', 'like', 'cats']
  2. ['_pad_', 'xbos', 'xfld', '1', 'i', 'like', 'cats']
  3. ['xbos', 'xfld', '1', 'i', 'like', 'cats', '_pad_']
  4. ['xbos', 'xfld', '1', '_pad_', 'i', 'like', 'cats']

As you noted, the hidden state in all cases is different and so are the predictions. Putting the padding in front (line 2) seems to be closest to the unpadded version. Putting the padding at the end (line 3) had the worst predictions in my experiments.

Did you guys find out how to deal with this correctly? @JensF @nigel

Nope, I never figured this out. However, github recently published a blog post for their semantic search machine learning model, and it seems like they use fastai’s language model library.

This medium post has more details on their process, so maybe it might shed some light there? I haven’t had time to look into this properly.

@hamelsmu I saw that you liked my initial post and you wrote the actual github post. Would you mind sharing some tips? The work you’ve done is really cool!

1 Like

I would not use padding and just retrieve hidden states one example at a time to start with. When you apply padding during inference you can destroy the hidden states esp because training doesn’t really use padding.

If you apply padding before it will definitely mess up the hidden states. If you apply padding after you have to pull the hidden state before the timestep where padding is introduced, which will be different for each example so you will have to do some accounting. Either way if you want to do this verify that your hidden states are consistent when you compare doing one example at a time vs. batch.

In the blog post I have a for loop that extracts the embeddings one example at a time because I didn’t have time at that moment to do it in batch.


1 Like

Also check your assumption of using the last hidden state- do you really think it encodes the whole text or just the last portion? Compare with averaging the hidden states. It’s important to experiment and think about this carefully


I see, so you used a for loop too.

Also check your assumption of using the last hidden state- do you really think it encodes the whole text or just the last portion?

Good call! I see that you averaged across all hidden states and also mentioned concat pooling in your paper. I’ll keep that in mind, thanks for your response :slight_smile:

I also used for loop for batch prediction but got memory issue when batch size is greater than 20 (mem increased and then crash)

Also you might want to use pre-padding instead of post-padding, and maybe even insert a little padding into your LM data so your model learns to handle it.


Please let me know if this solves the problem. I am still wrestling with hidden states, but I propose the following:

Lets assume you have constructed a tensor array called tensor_variable of length batch_size * max_length. Here, text will be padded.

hidden_states =[]
xb = tensor_variable[:,0]
for i in range(0, max_length):
    output = learner.model(xb)
    hidden_states.append( learner.model[0].hidden[2][0] )
    xb = tensor_variable[:,i]

If my understanding is correct, this loops over dimension 1 of tensor_variable, grabs column i, and passes it through the forward function. You then grab the hidden state after that forward pass and append to a list. You now have grabbed all of the hidden states generated after passing in each numericalized token in your batch.

When you want to average of the hidden states now, just keep track of how many elements in hidden_states you need to average over by comparing against the exact length of a sentence in row i. This way, you will avoid averaging over any padding.

Note, the above is constructed to monitor only the hidden state of the last layer ( hidden[2][0] ). I am not sure if this is what you are looking for.

I wanted to follow up with this approach I thought of over the weekend. You can use the functions present in a text_classifier_learner directly, as Jeremy has written the functions to do batch featurization for text classification already.

Lets assume you have a df of text (corpus_train) you would like to featurize. You could do:

corpus_valid = corpus_train.copy()
corpus_test  = corpus_train.copy()

#---you could make bs equal to as many docs as you want to featurize, if it fits in your gpu
databunch = TextClasDataBunch.from_df(path=path, train_df=corpus_train, valid_df=corpus_valid, test_df=corpus_test, bs = 64, tokenizer=tok, include_bos=False, include_eos=False,text_cols='smiles', vocab=vocab)
new_learner = text_classifier_learner(databunch, AWD_LSTM, drop_mult=0.5, pretrained=False)
new_learner.load_encoder('path to your saved encoder from your language model')
enc = new_learner.model[0]

   #---grab batch you want to featurize.
x =[0].to(device='cuda')
raw_outputs, outputs, mask = enc(x)
#---i simply copied and pasted the function masked_concat_pool from fastai as a new function in my script called custom_masked_concat_pool so as not to overwrite anything
features = custom_masked_concat_pool(outputs, mask)

I still do not fully understand how jeremy handles the padding tokens here. I think they do not contribute to the feature via the mask, but I need to study more thoroughly.