I’m trying to make sentence vectors to encode semantic similarity. The basic idea being that sentences with the same meaning should have the vectors that are close together.
Here’s what I’ve done so far.
Retrained the language model from lesson 10 IMDB using the text from the Stanford Natural Language Inference Corpus
Verified that it trained properly by inspecting the sentences that it produces. New sentences sound like ones from the corpus. Training achieved a perplexity of 14.80.
Added a new head the to model to average the hidden states
Here is the new head class, based on the PoolingLinearClassifier:
class PoolingVector(nn.Module): def __init__(self): super().__init__() def pool(self, x, bs, is_max): f = F.adaptive_max_pool1d if is_max else F.adaptive_avg_pool1d return f(x.permute(1,2,0), (1,)).view(bs,-1) def forward(self, input): raw_outputs, outputs = input output = outputs[-1] sl, bs,_ = output.size() avgpool = self.pool(output, bs, False) mxpool = self.pool(output, bs, True) x = torch.cat([output[-1], mxpool, avgpool], 1) return x, raw_outputs, outputs
When I compare the similar sentences from the corpus, they are not very much closer to each other than the ones that are not related.
Here’s a histogram of the entailed and non-entailed sentence pair distances:
I was expecting to see little overlap in the distance, but as you can see there is a large overlap. Far too much to be useful.
Is this what I should expect?
Thanks for any insight, I’m baffled and frustrated.