NLP: Any libraries/dictionaries out there for fixing common spelling errors?

msp · May 18, 2018, 1:01pm

Ah I see! I had not seen that you were using the GloVe word vectors.

As far as I can tell from their paper, they are not using regularization to train the embedding, so I share your surprise about the sparseness of the “correctly spelled word” vector.

urmas.pitsi · May 18, 2018, 2:26pm

Interesting, I am of the completely opposite opinion. My feeling is that this could be a fundamental flaw in embeddings and could help to improve learning substantially. It is like human mind works when seeing spelling errors: at first we map a misspelled word into correct semantic representation and then connect different semantic clusters into meaningful sentence or paragraph. Having parallel semantic clusters just adds extra confusion and parameters to the model. I just can’t see how this kind of representation would give an edge over a more efficient one. Of course I am not a qualified person to make argument here. Just trying to think what seems to make more sense.

wgpubs · May 18, 2018, 6:09pm

Yo @er214, which glove embedding are you looking at?

I just pulled the glove.6B.300d but I don’t see the misspellings you are seeing.

boxreb14 · May 18, 2018, 6:15pm

It was comman crawl so it might be that 42B tokens embedding

Even · May 18, 2018, 6:50pm

This has really interesting implications for chatbots and fake profiles. If you can probabilistically apply this vector to generated text you can easily create a more varied user and one that’s more believably human.

wgpubs · May 18, 2018, 6:53pm

Damn, you’re right.

Or turn it around and you could use this for detection by looking for sentence construction that is a little too good. I could see this as even being used in a captcha like manner.

jonm · May 19, 2018, 10:37am

It actually shouldn’t be a surprise that typos co-occur. This is primarily because of a strong link between noun/consonant identities, and simple transformations that direct typos (e.g. substitutions, transpositions etc.). There is quite a lot of literature around Graphemic Buffer Disorder that explains this in detail, but even those without an acquired disorder exhibit similar behaviour due to fatigue or attention lapses.
How are you transforming the word in order to then find the its neighbours? I.e. what is the transformed state of ‘becuase’, ‘pakage’, ‘ocur’ etc. ?

er214 · May 21, 2018, 6:59pm

Hi Jon,

It’s really simple - all you do is find a ‘direction’ that moves you from incorrect spellings to correct spellings by subtracting their respective word vectors. This direction is the transformation vector. To apply it and transform a misspelling you just add (or subtract depending on which way round you took the difference) this vector to the misspelled word vector.

Also, I think the reason why there are such distinct distributions of correctly spelled and misspelled words in the word vectors probably has something to do with spell checkers. As you say, there are probably good reasons why typos and misspelling co-occur, but I’m guessing that a big factor is that there is a huge corpus of texts that don’t contain any typos/spelling errors because they have been carefully edited and spell-checked. i.e. mistakes were made when the text was entered, but then corrected before it was published.

Out of interest, I tried to run similar analysis using Word2Vec (trained on Google News data), and also on FastText pretrained vectors. Word2Vec just about worked, but has far fewer examples of misspellings. FastText, since it is trained on subword units, doesn’t work at all. For example, the nearest neighbours of ‘relyable’ are:

 'cuddlyable',
 'copyable',
 '@cuddlyable',
 'unflyable',
 'queryable', etc

jonm · May 21, 2018, 9:33pm

Thanks Ed. That clarifies things well. It’s interesting that the typos are mathematically close, and not graphemically close they way they are in the psych literature. An interesting distinction but one to look at more closely a little later (for me).

wgpubs · May 21, 2018, 9:55pm

I was able to independently confirm these results myself against the glove.42B.300d.txt dataset. @er214’s findings here is indeed amazing!

And then I thought, could we improve the transformation vector by averaging the transformation vectors for multiple misspelled words (not just “relieable”)? The answer is yes.

I was able to match roughly 600 of the 700 misspelled words I discovered in the corpus I’m using with the those in the glove dataset. Following the same process @er214 outlined above, I created 600 transformation vectors (one based on each misspelled word), and then took there average for a final transformation vector.

Using this final transformation vector, I went back through my dictionary of misspelled/correctly spelled words and achieved significantly better results using my ensembled vector as compared to that based on a single misspelling (98% found vs. 89% respectively).

@er214, if you end up writing that paper or blog post and are interested in using my experiments, shoot me a message and I’d be glad to help.

er214 · May 21, 2018, 10:25pm

I’ve done the same thing as you, although I’ve created a huge training set by using the transformation vector in reverse to corrupt correct spellings. You can basically iterate - start with a few example to create an initial vector, use this to create a larger set of misspellings, then use this to create a better transformation vector.

Another interesting thing you can do is to ‘score’ a word for how likely it is to be misspelled. You just take the dot product of the word vector with the transformation vector to see how far it points in the spelling direction. This gives a bit more of a clue as to why we have this pattern in the vectors. If you score all the vocab in this way there is a wide distribution of scores. At one end you find lots of business and government related words, and at the other you get very informal language such as ‘plz’ and ‘thnkx’ (and spelling mistakes + typos). So, I think the score is measuring something like the formality of the text: carefully edited and proofed texts (as you are likely to find consistently in business news stories) sit at one end of the distribution, which extends through to the informal language used in unedited comments in forums / emails / twitter.

It then starts to make sense why the model which built the vectors has behaved in this way: when it see words like ‘plz’ and ‘thnx’ it will know it is more likely to encounter the incorrectly spelled versions of words; and the reverse is true when you find mention of ‘businesses’.

Regarding a full write up, I’m in the middle of it. I’ve got a job interview tomorrow, but will hopefully get something published - along with the code - on Wednesday.

er214 · May 21, 2018, 10:27pm

One more thing @wgpubs - I’m using the glove.840B.300d vectors. Not sure how different they are to those you used, but as mentioned above I’ve also tried Word2Vec and FastText but you don’t get the same results.

Even · May 22, 2018, 5:33am

Good luck with the interview! I’m guessing you’re already planning to, but if I were you i’d try to find a way to bring this project up.

As someone who’s interviewed a lot of data scientists/ml engineers something like this is very impressive.

jeremy · May 22, 2018, 10:47pm

Agreed - my tweet of this post has more favorites than (IIRC) anything else I’ve posted!

er214 · May 24, 2018, 10:50am

Hi all,

Here’s a link to my write up.

It’s a bit long, but hopefully at least makes sense. I’ve also uploaded a jupyter notebook which walks through the code on github (https://github.com/er214/spellchecker). I’ve neither blogged, nor put code on github before, so would welcome any comments / corrections / suggestions you might might have.

The github repo also includes a pre-calculated dictionary mapping from 37,000+ common misspellings to (what I hope is) their corrections. I don’t have a good way of testing, so any feedback on what this looks like would also be great.

One thing to note is that it takes forever to calculate nearest neighbours using scipy. I’ve used faiss from facebook, which is great, but not the easiest to install.

Thanks
Ed

msp · May 24, 2018, 11:04am

Hi, excellent! Is there a typo in your formula for Germany,

vector(Paris) — vector(France) + vector(Berlin)

Shouldn’t it be as follows?

vector(France) — vector(Paris) + vector(Berlin)

msp · May 24, 2018, 11:52am

I think the blog post is great, congrats! I find the idea of the dot product really clever.

Here are a couple of suggestions, several easy resources that you could use as tools:

dictionaries
string distance

Concerning point 1: you assume that the n most frequent words are likely spelled correctly, which is a reasonable assumption, however you could just use the subset of most frequent words which appear in a dictionary.

Concerning point 2: when gathering misspellings, you assume that the 10 closest items to the vector sum are misspellings. However, you could easily treat those as candidates, and check whether they are actually misspellings by using some string distance between the candidate and the initial word (such as Levenshtein edit distance, or similar). That way you would also not need to limit yourself to 10, but could get a large number of candidates and only retain the most likely.

er214 · May 24, 2018, 12:42pm

Well spotted! So much for my careful proof read…

er214 · May 24, 2018, 12:52pm

Thanks!

On the second point, that is roughly what I’ve done in the code: look at edit distance (but only from 10 neighbours) and reject anything that requires more than 2 edits. You also need to reject plurals and different endings for verbs, so I’ve written a bit of hacky code to get this done. I’ll add a few words to clarify in the blog.

On the first point, I completely agree, using an existing word list would have been more sensible. With hindsight I’m not sure why I didn’t.

jeremy · May 24, 2018, 4:34pm

@er214 this is great! Minor things:

2nd paragraph: know–>known
Add your twitter handle to Medium so that sharing it automatically at-mentions you.