NLP: Any libraries/dictionaries out there for fixing common spelling errors?

er214 · May 21, 2018, 10:25pm

I’ve done the same thing as you, although I’ve created a huge training set by using the transformation vector in reverse to corrupt correct spellings. You can basically iterate - start with a few example to create an initial vector, use this to create a larger set of misspellings, then use this to create a better transformation vector.

Another interesting thing you can do is to ‘score’ a word for how likely it is to be misspelled. You just take the dot product of the word vector with the transformation vector to see how far it points in the spelling direction. This gives a bit more of a clue as to why we have this pattern in the vectors. If you score all the vocab in this way there is a wide distribution of scores. At one end you find lots of business and government related words, and at the other you get very informal language such as ‘plz’ and ‘thnkx’ (and spelling mistakes + typos). So, I think the score is measuring something like the formality of the text: carefully edited and proofed texts (as you are likely to find consistently in business news stories) sit at one end of the distribution, which extends through to the informal language used in unedited comments in forums / emails / twitter.

It then starts to make sense why the model which built the vectors has behaved in this way: when it see words like ‘plz’ and ‘thnx’ it will know it is more likely to encounter the incorrectly spelled versions of words; and the reverse is true when you find mention of ‘businesses’.

Regarding a full write up, I’m in the middle of it. I’ve got a job interview tomorrow, but will hopefully get something published - along with the code - on Wednesday.