As an alternative, I was wondering about using word vector similarities to find corrections for common spelling errors. Pre-trained vectors like glove contain lots of spelling mistakes, and I thought these would be close
(ideally closest) to their correct counterparts.
It turns out that this isn’t true, but the reality is is more interesting - all the spelling mistakes are clustered together. So, for example, if you search for the nearest neighbours of “relieable” you get:
['relieable', 'relyable', 'realible', 'relable', 'reliabe', 'realiable', 'relaiable', 'relaible', 'trustworth', 'trustfull', 'consitant', 'stabel', 'accuarate', 'acurrate', 'accruate']
Note that the correct spelling isn’t anywhere to be seen. In fact, it’s miles away - there are 424,816 words closer (using cosine distance) to “relieable” than “reliable”!
So, I wondered if you could ‘correct’ a spelling by applying a transformation to move us from the spelling mistakes area of vector space to the correctly spelled words area. It turns out you can.
I’ve taken the average difference between the first 8 misspellings in the list above and “reliable”. This creates the transformation vector. We simply subtract this from the vector of a misspelled word to shift us into the correctly spelled words area, and then look for nearest neighbours.
The table below shows a few examples of the nearest neighbour, both of the incorrectly spelled word, and then of the transformed word.
misspelled word |
neighbours of misspelled word |
neighbours of transformed word |
becuase |
becuase; becasue; beacuse; b/c; becouse |
because; even; fact; sure; though |
definately |
definately; definetly; definatly; definitly; definitely |
definitely; sure; certainly; well; really |
consistant |
consistant; consistantly; inconsistant; consistent; consitant |
consistent; reliable; consistant; consistently; accurate |
pakage |
pakage; packge; pacage; pacakge; packege |
package; packages; pakage; reliable; offer |
basicly |
basicly; basicaly; jsut; actualy; bascially |
basically; simply; just; actually; only |
ocur |
ocur; occour; occurr; occure; happpen |
ocur; occur; arise; happen; reliably
|
In every case except the last, the closest neighbour to the transformed word is the correct spelling (in bold). Note also that the transformed results are skewed towards ‘reliable’. I’m sure you could get better results by building a transformation vector based on a wider sample than just misspellings of reliable.
Interesting that this is such a consistent result. It apears to sugest that speling erors and typoes strongly co-ocur, rather than appearing in isolation?