NLP: Any libraries/dictionaries out there for fixing common spelling errors?

This has really interesting implications for chatbots and fake profiles. If you can probabilistically apply this vector to generated text you can easily create a more varied user and one that’s more believably human.

2 Likes

Damn, you’re right.

Or turn it around and you could use this for detection by looking for sentence construction that is a little too good. I could see this as even being used in a captcha like manner.

It actually shouldn’t be a surprise that typos co-occur. This is primarily because of a strong link between noun/consonant identities, and simple transformations that direct typos (e.g. substitutions, transpositions etc.). There is quite a lot of literature around Graphemic Buffer Disorder that explains this in detail, but even those without an acquired disorder exhibit similar behaviour due to fatigue or attention lapses.
How are you transforming the word in order to then find the its neighbours? I.e. what is the transformed state of ‘becuase’, ‘pakage’, ‘ocur’ etc. ?

Hi Jon,

It’s really simple - all you do is find a ‘direction’ that moves you from incorrect spellings to correct spellings by subtracting their respective word vectors. This direction is the transformation vector. To apply it and transform a misspelling you just add (or subtract depending on which way round you took the difference) this vector to the misspelled word vector.

Also, I think the reason why there are such distinct distributions of correctly spelled and misspelled words in the word vectors probably has something to do with spell checkers. As you say, there are probably good reasons why typos and misspelling co-occur, but I’m guessing that a big factor is that there is a huge corpus of texts that don’t contain any typos/spelling errors because they have been carefully edited and spell-checked. i.e. mistakes were made when the text was entered, but then corrected before it was published.

Out of interest, I tried to run similar analysis using Word2Vec (trained on Google News data), and also on FastText pretrained vectors. Word2Vec just about worked, but has far fewer examples of misspellings. FastText, since it is trained on subword units, doesn’t work at all. For example, the nearest neighbours of ‘relyable’ are:

 'cuddlyable',
 'copyable',
 '@cuddlyable',
 'unflyable',
 'queryable', etc

Thanks Ed. That clarifies things well. It’s interesting that the typos are mathematically close, and not graphemically close they way they are in the psych literature. An interesting distinction but one to look at more closely a little later (for me).

I was able to independently confirm these results myself against the glove.42B.300d.txt dataset. @er214’s findings here is indeed amazing!

And then I thought, could we improve the transformation vector by averaging the transformation vectors for multiple misspelled words (not just “relieable”)? The answer is yes.

I was able to match roughly 600 of the 700 misspelled words I discovered in the corpus I’m using with the those in the glove dataset. Following the same process @er214 outlined above, I created 600 transformation vectors (one based on each misspelled word), and then took there average for a final transformation vector.

Using this final transformation vector, I went back through my dictionary of misspelled/correctly spelled words and achieved significantly better results using my ensembled vector as compared to that based on a single misspelling (98% found vs. 89% respectively).

@er214, if you end up writing that paper or blog post and are interested in using my experiments, shoot me a message and I’d be glad to help.

4 Likes

I’ve done the same thing as you, although I’ve created a huge training set by using the transformation vector in reverse to corrupt correct spellings. You can basically iterate - start with a few example to create an initial vector, use this to create a larger set of misspellings, then use this to create a better transformation vector.

Another interesting thing you can do is to ‘score’ a word for how likely it is to be misspelled. You just take the dot product of the word vector with the transformation vector to see how far it points in the spelling direction. This gives a bit more of a clue as to why we have this pattern in the vectors. If you score all the vocab in this way there is a wide distribution of scores. At one end you find lots of business and government related words, and at the other you get very informal language such as ‘plz’ and ‘thnkx’ (and spelling mistakes + typos). So, I think the score is measuring something like the formality of the text: carefully edited and proofed texts (as you are likely to find consistently in business news stories) sit at one end of the distribution, which extends through to the informal language used in unedited comments in forums / emails / twitter.

It then starts to make sense why the model which built the vectors has behaved in this way: when it see words like ‘plz’ and ‘thnx’ it will know it is more likely to encounter the incorrectly spelled versions of words; and the reverse is true when you find mention of ‘businesses’.

Regarding a full write up, I’m in the middle of it. I’ve got a job interview tomorrow, but will hopefully get something published - along with the code - on Wednesday.

2 Likes

One more thing @wgpubs - I’m using the glove.840B.300d vectors. Not sure how different they are to those you used, but as mentioned above I’ve also tried Word2Vec and FastText but you don’t get the same results.

Good luck with the interview! I’m guessing you’re already planning to, but if I were you i’d try to find a way to bring this project up.

As someone who’s interviewed a lot of data scientists/ml engineers something like this is very impressive.

Agreed - my tweet of this post has more favorites than (IIRC) anything else I’ve posted!

Hi all,

Here’s a link to my write up.

It’s a bit long, but hopefully at least makes sense. I’ve also uploaded a jupyter notebook which walks through the code on github (https://github.com/er214/spellchecker). I’ve neither blogged, nor put code on github before, so would welcome any comments / corrections / suggestions you might might have.

The github repo also includes a pre-calculated dictionary mapping from 37,000+ common misspellings to (what I hope is) their corrections. I don’t have a good way of testing, so any feedback on what this looks like would also be great.

One thing to note is that it takes forever to calculate nearest neighbours using scipy. I’ve used faiss from facebook, which is great, but not the easiest to install.

Thanks
Ed

12 Likes

Hi, excellent! Is there a typo in your formula for Germany,

vector(Paris) — vector(France) + vector(Berlin)

Shouldn’t it be as follows?

vector(France) — vector(Paris) + vector(Berlin)

I think the blog post is great, congrats! I find the idea of the dot product really clever.

Here are a couple of suggestions, several easy resources that you could use as tools:

  1. dictionaries
  2. string distance

Concerning point 1: you assume that the n most frequent words are likely spelled correctly, which is a reasonable assumption, however you could just use the subset of most frequent words which appear in a dictionary.

Concerning point 2: when gathering misspellings, you assume that the 10 closest items to the vector sum are misspellings. However, you could easily treat those as candidates, and check whether they are actually misspellings by using some string distance between the candidate and the initial word (such as Levenshtein edit distance, or similar). That way you would also not need to limit yourself to 10, but could get a large number of candidates and only retain the most likely.

Well spotted! So much for my careful proof read…

Thanks!

On the second point, that is roughly what I’ve done in the code: look at edit distance (but only from 10 neighbours) and reject anything that requires more than 2 edits. You also need to reject plurals and different endings for verbs, so I’ve written a bit of hacky code to get this done. I’ll add a few words to clarify in the blog.

On the first point, I completely agree, using an existing word list would have been more sensible. With hindsight I’m not sure why I didn’t.

@er214 this is great! Minor things:

  • 2nd paragraph: know–>known
  • Add your twitter handle to Medium so that sharing it automatically at-mentions you.

Thanks @jeremy!

Some more fun results.

Given that the misspellings phenomenon looks like it’s driven by differences in the distributions in the source data, I wondered if you might be able to do other transformations. In particular, could you create a transformation vector to turn my turgid businesslike writing into flowery, poetic prose. Someone responded to Jeremy’s tweet wondering if you could build a transformation to make you sound smarter. Well, in a round about sort of way, it looks like you can.

Those who use long tricky words do tend always to use them, whereas people like me with a more limited vocabulary don’t often use ‘resplendent’ or ‘supercilious’ in our writing. If this rarefied vocabulary only crops up frequently in particular sources, then it will probably be in a different bit of the vector space.

So, I’ve tried building a transformation (a pretentiousness transformation?) by taking the average between pairs of words that mean roughly the same thing, but one of them is souped-up. For example,

('impoverished', 'poor'),
('diffident', 'shy'),
('congenial', 'agreeable'),
('droll', 'witty'),
('valiant', 'brave'),
('servile', 'dutiful')

And so on. The resulting transformation vector shares characteristic spikes in 2 dimensions with the spelling vector, but not in the way you might expect. The ‘souping-up’ direction is the same as the misspelling direction. I think this is because the ‘best spelled’ end of the spelling direction is based on business news sources, which tend to stick to fairly straightforward language.

To generate interesting ‘souped-up’ alternatives for words I’ve applied the transformation (after chopping off these spikes it shares with the misspelling vector, otherwise we start creating misspellings), taken the nearest neighbours, and then removed from this set the nearest neighbours of the un-transformed query word. This just gives you better results. For example, if we use the word ‘give’,

  • the nearest neighbours are ‘giving’, ‘make’, ‘let’, ‘want’, ‘you’, ‘get’, ‘take’, ‘bring’, ‘need’;
  • the transformed neighbours are ‘furnish’, ‘sufficient’, ‘compel’, ‘provide’, ‘bestow’, ‘obtain’, ‘afforded’, ‘confer’, ‘proffer’

It doesn’t always work, often giving you rubbish, but here are few examples of the options this method gives you:

query fancy sounding alternatives
smelly fusty, foul-smelling, odiferous, odorous, malodorous, fetid, putrid, reeking, mildewy
funny comical, droll, hilariously, humourous, ironical, irreverent, sardonic, witty
fat adipose, carbohydrate, cholesterol, corpulent, rotund, triglyceride
clever adroit, artful, astute, cleverness, deft, erudite, shrewd
book monograph, scholarly, treatise, tome, textbook, two-volume, nonfiction
loud boisterous, cacophonous, clamorous, deafening,ear-splitting,resounded,sonorous,strident
10 Likes

Oh my, what you have done…

:wink:

Actually perhaps some good can come of this - what if you created a “pretentiousness score” for a document based on this? Then averaged that across articles from writers in (for instance) the NY Times?

2 Likes

Awesome post @er214. The idea to use an initial transformation vector to find more candidates and then revise the transformation vector looks like a cool example of semi-supervised learning.

I was wondering if looking at difference between fasttext and ULMFit’s wiki word embeddings will show what changes language model brings about. Any ideas ?