What’s the best approach to get semantic sentence similarity (2 sentences with different wording but same meaning) in Fastai? I looked at Google’s Universal Sentence Encoder (512 long vec) which works well but it is only available in few languages and can’t be trained outside Google.
I have an idea, and it is inspired by DeViSE. https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/41473.pdf
You can train a classification model (or maybe some seq2seq model?), and only take the output from the encoder which will be some D-dimensional vector. Suppose that we are talking about a seq2seq translation model, intuitively speaking 2 english sentences with the same meaning will have similar outputs from the encoder (otherwise the translation will never be accurate). You can compute cosine similarities between the activations of different sentences and similar sentences will have ‘similar vectors’ too.
Take a look at @lesscomfortable article on “duplicate image finder”: maybe you can use similar approach…
Thanks @dreambeats and @ste for your suggestions.
I thought FastAI would have an easy way of doing this (similar to np.allclose(encoder weights1,2) as in @rachel’s NLP course notebook https://github.com/fastai/course-nlp/blob/master/4-nn-imdb.ipynb ) but that works for words only.
If you missed it, there is additional discussion on this issue (semantic sentence similarity) in this thread: