NLP: Any libraries/dictionaries out there for fixing common spelling errors?

@er214, I’m unfortunately not aware of a corpus that people have been using for benchmarking recently. Grammatical error correction is a closely related task that has received more recent attention, but focuses (as the name suggests) more on grammatical errors rather than misspellings.

@urmas.pitsi, modifying an existing vector space to better represent misspellings would be a cool thing to try!

I think at the moment the impression in the community is that we don’t really need to deal with misspellings as neural models are somewhat robust to them. A counterpoint might be a recent ICLR 2018 paper that shows that NMT in particular is susceptible to noisy data such as misspellings (https://openreview.net/forum?id=BJ8vJebC-). An evaluation of how susceptible typical word-based and character-based models are to misspellings would therefore be interesting and could serve as a motivation for this kind of work.

2 Likes

My guess is that it wouldn’t make much difference. But definitely worth trying - it might give a valuable boost in the context of a kaggle competition!

The transformation vector itself is quite interesting. It’s close to zero on all except about 3 dimensions. That is, the area, or perhaps that should be direction, within the vector space for spelling mistakes/ informal language is very specific. As such, I would guess that neural nets learn to ignore these dimensions.

Ultimately, preprocessing to correct the spelling mistakes would only make a difference if these dimensions of the word vectors are also used to store other useful bits of information.

I was also thinking along the lines of seq2seq models.
We train a model on sentences with correct spelling words. The task of the seq2seq model here is to predict the next word vector.
Then, when observing a new/test sentence, we can see the suggested vector for the following word. If the suggested vector is close to the current word, we go ahead, otherwise, it could be the indicator of a spelling error. We can then use the transformation mentioned.

Though, one concern I have with this model is, what if the context understanding of the model is different from the author’s intended context, leading us into thinking the correct word is mis-spelled.

One way to reduce the false positives I think would be a simple edit distance comparison with nearby words after transformation. If any passes a threshold one can accept a suggested word, else, continue with the current spelling.

2 Likes

Very interesting. Are you using some kind of regularization in the training?

No - there is no training. All I’m doing is taking the average difference between two word vectors. It’s the same idea as the word vector maths introduced in the original Mikolov word2vec paper. In that paper they give the example that vector(‘King’) - vector(‘Man’) + vector(‘Woman’) results in a vector that is closest (in cosine distance) to the vector for ‘Queen’.

You could re-interpret the equation as consisting of a royalty transformation vector = vector(‘King’) - vector(‘Man’), which is then applied to the word ‘Woman’. Presumably you could also apply it to ‘boy’ to get ‘prince’, or ‘girl’ to get to ‘princess’.

All I have done is replace ‘Man’ and ‘King’ with ‘misspelled word’ and ‘correctly spelled word’.

A few other interesting points I’ve found after messing around with things a bit more:

  1. You can also find examples of incorrect spellings by applying the transformation in reverse. This allows you to build a broader set of correct/incorrect spelling pairs with which to build a more accurate transformation vector.
  2. Assuming my earlier thoughts are correct that this is caused by the existence of spelling correction software, then you might not find the same pattern in all the pre-trained word vectors. I am using the Glove 840B vectors which are trained on common crawl data, i.e. a very wide distribution of stuff found online, including ‘natural’ non-spell-checked language. Word2Vec is trained on a google news dataset, so presumably much more likely to be spelled checked, and therefore there might not be such a clear distinction between correct and incorrect spellings.
  3. The Glove vectors are cased (i.e. they include separate vectors for words IN CAPITALS). You can create similar transformation vectors to either capitalise a word, or to capitalise the initial letter. Perhaps, if you could find enough examples, there might be a transformation for abbreviations to account for the ‘because’=>‘b/c’ case you mentioned in an earlier comment.
4 Likes

Ah I see! I had not seen that you were using the GloVe word vectors.

As far as I can tell from their paper, they are not using regularization to train the embedding, so I share your surprise about the sparseness of the “correctly spelled word” vector.

Interesting, I am of the completely opposite opinion. My feeling is that this could be a fundamental flaw in embeddings and could help to improve learning substantially. It is like human mind works when seeing spelling errors: at first we map a misspelled word into correct semantic representation and then connect different semantic clusters into meaningful sentence or paragraph. Having parallel semantic clusters just adds extra confusion and parameters to the model. I just can’t see how this kind of representation would give an edge over a more efficient one. Of course I am not a qualified person to make argument here. Just trying to think what seems to make more sense.

1 Like

Yo @er214, which glove embedding are you looking at?

I just pulled the glove.6B.300d but I don’t see the misspellings you are seeing.

1 Like

It was comman crawl so it might be that 42B tokens embedding

This has really interesting implications for chatbots and fake profiles. If you can probabilistically apply this vector to generated text you can easily create a more varied user and one that’s more believably human.

2 Likes

Damn, you’re right.

Or turn it around and you could use this for detection by looking for sentence construction that is a little too good. I could see this as even being used in a captcha like manner.

It actually shouldn’t be a surprise that typos co-occur. This is primarily because of a strong link between noun/consonant identities, and simple transformations that direct typos (e.g. substitutions, transpositions etc.). There is quite a lot of literature around Graphemic Buffer Disorder that explains this in detail, but even those without an acquired disorder exhibit similar behaviour due to fatigue or attention lapses.
How are you transforming the word in order to then find the its neighbours? I.e. what is the transformed state of ‘becuase’, ‘pakage’, ‘ocur’ etc. ?

Hi Jon,

It’s really simple - all you do is find a ‘direction’ that moves you from incorrect spellings to correct spellings by subtracting their respective word vectors. This direction is the transformation vector. To apply it and transform a misspelling you just add (or subtract depending on which way round you took the difference) this vector to the misspelled word vector.

Also, I think the reason why there are such distinct distributions of correctly spelled and misspelled words in the word vectors probably has something to do with spell checkers. As you say, there are probably good reasons why typos and misspelling co-occur, but I’m guessing that a big factor is that there is a huge corpus of texts that don’t contain any typos/spelling errors because they have been carefully edited and spell-checked. i.e. mistakes were made when the text was entered, but then corrected before it was published.

Out of interest, I tried to run similar analysis using Word2Vec (trained on Google News data), and also on FastText pretrained vectors. Word2Vec just about worked, but has far fewer examples of misspellings. FastText, since it is trained on subword units, doesn’t work at all. For example, the nearest neighbours of ‘relyable’ are:

 'cuddlyable',
 'copyable',
 '@cuddlyable',
 'unflyable',
 'queryable', etc

Thanks Ed. That clarifies things well. It’s interesting that the typos are mathematically close, and not graphemically close they way they are in the psych literature. An interesting distinction but one to look at more closely a little later (for me).

I was able to independently confirm these results myself against the glove.42B.300d.txt dataset. @er214’s findings here is indeed amazing!

And then I thought, could we improve the transformation vector by averaging the transformation vectors for multiple misspelled words (not just “relieable”)? The answer is yes.

I was able to match roughly 600 of the 700 misspelled words I discovered in the corpus I’m using with the those in the glove dataset. Following the same process @er214 outlined above, I created 600 transformation vectors (one based on each misspelled word), and then took there average for a final transformation vector.

Using this final transformation vector, I went back through my dictionary of misspelled/correctly spelled words and achieved significantly better results using my ensembled vector as compared to that based on a single misspelling (98% found vs. 89% respectively).

@er214, if you end up writing that paper or blog post and are interested in using my experiments, shoot me a message and I’d be glad to help.

4 Likes

I’ve done the same thing as you, although I’ve created a huge training set by using the transformation vector in reverse to corrupt correct spellings. You can basically iterate - start with a few example to create an initial vector, use this to create a larger set of misspellings, then use this to create a better transformation vector.

Another interesting thing you can do is to ‘score’ a word for how likely it is to be misspelled. You just take the dot product of the word vector with the transformation vector to see how far it points in the spelling direction. This gives a bit more of a clue as to why we have this pattern in the vectors. If you score all the vocab in this way there is a wide distribution of scores. At one end you find lots of business and government related words, and at the other you get very informal language such as ‘plz’ and ‘thnkx’ (and spelling mistakes + typos). So, I think the score is measuring something like the formality of the text: carefully edited and proofed texts (as you are likely to find consistently in business news stories) sit at one end of the distribution, which extends through to the informal language used in unedited comments in forums / emails / twitter.

It then starts to make sense why the model which built the vectors has behaved in this way: when it see words like ‘plz’ and ‘thnx’ it will know it is more likely to encounter the incorrectly spelled versions of words; and the reverse is true when you find mention of ‘businesses’.

Regarding a full write up, I’m in the middle of it. I’ve got a job interview tomorrow, but will hopefully get something published - along with the code - on Wednesday.

2 Likes

One more thing @wgpubs - I’m using the glove.840B.300d vectors. Not sure how different they are to those you used, but as mentioned above I’ve also tried Word2Vec and FastText but you don’t get the same results.

Good luck with the interview! I’m guessing you’re already planning to, but if I were you i’d try to find a way to bring this project up.

As someone who’s interviewed a lot of data scientists/ml engineers something like this is very impressive.

Agreed - my tweet of this post has more favorites than (IIRC) anything else I’ve posted!

Hi all,

Here’s a link to my write up.

It’s a bit long, but hopefully at least makes sense. I’ve also uploaded a jupyter notebook which walks through the code on github (https://github.com/er214/spellchecker). I’ve neither blogged, nor put code on github before, so would welcome any comments / corrections / suggestions you might might have.

The github repo also includes a pre-calculated dictionary mapping from 37,000+ common misspellings to (what I hope is) their corrections. I don’t have a good way of testing, so any feedback on what this looks like would also be great.

One thing to note is that it takes forever to calculate nearest neighbours using scipy. I’ve used faiss from facebook, which is great, but not the easiest to install.

Thanks
Ed

12 Likes