I’m doing some clustering work on sentences. If I use Spacy vectors (averaged for the sentence) or InferSent vectors I can get some separation, on at least 2 (maybe 3) meaningful clusters.
I was hoping to improve that using ULMFiT but it isn’t working at all (ie, no separation of any clusters, similar sentences don’t appear in similar areas after reducing dimensions etc)
My process is this: Run the text through a fine tuned ULMFiT model, get the final state of the last LSTM (400 dimensions) and cluster on that.
I’ve tried PCA before clustering and tried without it. I’ve tried concatenating all 3 LSTM layers.
It’s possible my process for getting the final LSTM state is wrong. I can’t find code anywhere for it, and it appears correct to me (and putting different text in produces different numbers, so… good?)
Classification on the same type of text using the same model works well, so it can separate when supervised.
Is there any reason why I should expect this to work worse than word or sentence embeddings?