I’ve trained a language model using fastai’s
Learner class. From there I’ve extracted the encoder, and with the encoder I can use the final hidden state as a vector representation of a whole document (to use as features for clustering, ranking, classification etc. downstream).
I can loop through each document and get the hidden state one by one without any issues. However, I’m having problems constructing minibatches of data to get the hidden representations of a set of documents in an efficient way. To get the activations of a minibatch, pytorch must pad the input sequences so they’re all the same length of the maximum sequence. However, upon inspecting the values and the cosine similarities of the vectors, it looks like these sentences:
encoder("this is a sentence .") encoder("this is a sentence . <pad> <pad> <pad>")
give very different hidden representations. I.e. the extra padding in a sentence significantly alters the hidden state. My guess is because in the training of the language model, the learner never actually sees the
<pad> token and so its embeddings are poorly defined. (The data for the language model is just a concatenation of all documents.)
What’s the solution here? Iterating through each document one by one is really inefficient. But I don’t see a way of batching them up correctly. Perhaps during training I could potentially add random sequences of
<pad> to make the language model know that those tokens shouldn’t affect the hidden representation? What’s the best way to get hidden states of documents as numpy arrays?
Thanks again Jeremy for your wonderful MOOC!