NLP: Any libraries/dictionaries out there for fixing common spelling errors?

er214 · May 18, 2018, 12:30pm

No - there is no training. All I’m doing is taking the average difference between two word vectors. It’s the same idea as the word vector maths introduced in the original Mikolov word2vec paper. In that paper they give the example that vector(‘King’) - vector(‘Man’) + vector(‘Woman’) results in a vector that is closest (in cosine distance) to the vector for ‘Queen’.

You could re-interpret the equation as consisting of a royalty transformation vector = vector(‘King’) - vector(‘Man’), which is then applied to the word ‘Woman’. Presumably you could also apply it to ‘boy’ to get ‘prince’, or ‘girl’ to get to ‘princess’.

All I have done is replace ‘Man’ and ‘King’ with ‘misspelled word’ and ‘correctly spelled word’.

A few other interesting points I’ve found after messing around with things a bit more:

You can also find examples of incorrect spellings by applying the transformation in reverse. This allows you to build a broader set of correct/incorrect spelling pairs with which to build a more accurate transformation vector.
Assuming my earlier thoughts are correct that this is caused by the existence of spelling correction software, then you might not find the same pattern in all the pre-trained word vectors. I am using the Glove 840B vectors which are trained on common crawl data, i.e. a very wide distribution of stuff found online, including ‘natural’ non-spell-checked language. Word2Vec is trained on a google news dataset, so presumably much more likely to be spelled checked, and therefore there might not be such a clear distinction between correct and incorrect spellings.
The Glove vectors are cased (i.e. they include separate vectors for words IN CAPITALS). You can create similar transformation vectors to either capitalise a word, or to capitalise the initial letter. Perhaps, if you could find enough examples, there might be a transformation for abbreviations to account for the ‘because’=>‘b/c’ case you mentioned in an earlier comment.