NLP: Any libraries/dictionaries out there for fixing common spelling errors?

er214 · May 21, 2018, 10:27pm

One more thing @wgpubs - I’m using the glove.840B.300d vectors. Not sure how different they are to those you used, but as mentioned above I’ve also tried Word2Vec and FastText but you don’t get the same results.

Even · May 22, 2018, 5:33am

Good luck with the interview! I’m guessing you’re already planning to, but if I were you i’d try to find a way to bring this project up.

As someone who’s interviewed a lot of data scientists/ml engineers something like this is very impressive.

jeremy · May 22, 2018, 10:47pm

Agreed - my tweet of this post has more favorites than (IIRC) anything else I’ve posted!

er214 · May 24, 2018, 10:50am

Hi all,

Here’s a link to my write up.

It’s a bit long, but hopefully at least makes sense. I’ve also uploaded a jupyter notebook which walks through the code on github (https://github.com/er214/spellchecker). I’ve neither blogged, nor put code on github before, so would welcome any comments / corrections / suggestions you might might have.

The github repo also includes a pre-calculated dictionary mapping from 37,000+ common misspellings to (what I hope is) their corrections. I don’t have a good way of testing, so any feedback on what this looks like would also be great.

One thing to note is that it takes forever to calculate nearest neighbours using scipy. I’ve used faiss from facebook, which is great, but not the easiest to install.

Thanks
Ed

msp · May 24, 2018, 11:04am

Hi, excellent! Is there a typo in your formula for Germany,

vector(Paris) — vector(France) + vector(Berlin)

Shouldn’t it be as follows?

vector(France) — vector(Paris) + vector(Berlin)

msp · May 24, 2018, 11:52am

I think the blog post is great, congrats! I find the idea of the dot product really clever.

Here are a couple of suggestions, several easy resources that you could use as tools:

dictionaries
string distance

Concerning point 1: you assume that the n most frequent words are likely spelled correctly, which is a reasonable assumption, however you could just use the subset of most frequent words which appear in a dictionary.

Concerning point 2: when gathering misspellings, you assume that the 10 closest items to the vector sum are misspellings. However, you could easily treat those as candidates, and check whether they are actually misspellings by using some string distance between the candidate and the initial word (such as Levenshtein edit distance, or similar). That way you would also not need to limit yourself to 10, but could get a large number of candidates and only retain the most likely.

er214 · May 24, 2018, 12:42pm

Well spotted! So much for my careful proof read…

er214 · May 24, 2018, 12:52pm

Thanks!

On the second point, that is roughly what I’ve done in the code: look at edit distance (but only from 10 neighbours) and reject anything that requires more than 2 edits. You also need to reject plurals and different endings for verbs, so I’ve written a bit of hacky code to get this done. I’ll add a few words to clarify in the blog.

On the first point, I completely agree, using an existing word list would have been more sensible. With hindsight I’m not sure why I didn’t.

jeremy · May 24, 2018, 4:34pm

@er214 this is great! Minor things:

2nd paragraph: know–>known
Add your twitter handle to Medium so that sharing it automatically at-mentions you.

er214 · May 24, 2018, 5:06pm

Thanks @jeremy!

Some more fun results.

Given that the misspellings phenomenon looks like it’s driven by differences in the distributions in the source data, I wondered if you might be able to do other transformations. In particular, could you create a transformation vector to turn my turgid businesslike writing into flowery, poetic prose. Someone responded to Jeremy’s tweet wondering if you could build a transformation to make you sound smarter. Well, in a round about sort of way, it looks like you can.

Those who use long tricky words do tend always to use them, whereas people like me with a more limited vocabulary don’t often use ‘resplendent’ or ‘supercilious’ in our writing. If this rarefied vocabulary only crops up frequently in particular sources, then it will probably be in a different bit of the vector space.

So, I’ve tried building a transformation (a pretentiousness transformation?) by taking the average between pairs of words that mean roughly the same thing, but one of them is souped-up. For example,

('impoverished', 'poor'),
('diffident', 'shy'),
('congenial', 'agreeable'),
('droll', 'witty'),
('valiant', 'brave'),
('servile', 'dutiful')

And so on. The resulting transformation vector shares characteristic spikes in 2 dimensions with the spelling vector, but not in the way you might expect. The ‘souping-up’ direction is the same as the misspelling direction. I think this is because the ‘best spelled’ end of the spelling direction is based on business news sources, which tend to stick to fairly straightforward language.

To generate interesting ‘souped-up’ alternatives for words I’ve applied the transformation (after chopping off these spikes it shares with the misspelling vector, otherwise we start creating misspellings), taken the nearest neighbours, and then removed from this set the nearest neighbours of the un-transformed query word. This just gives you better results. For example, if we use the word ‘give’,

the nearest neighbours are ‘giving’, ‘make’, ‘let’, ‘want’, ‘you’, ‘get’, ‘take’, ‘bring’, ‘need’;
the transformed neighbours are ‘furnish’, ‘sufficient’, ‘compel’, ‘provide’, ‘bestow’, ‘obtain’, ‘afforded’, ‘confer’, ‘proffer’

It doesn’t always work, often giving you rubbish, but here are few examples of the options this method gives you:

query	fancy sounding alternatives
smelly	fusty, foul-smelling, odiferous, odorous, malodorous, fetid, putrid, reeking, mildewy
funny	comical, droll, hilariously, humourous, ironical, irreverent, sardonic, witty
fat	adipose, carbohydrate, cholesterol, corpulent, rotund, triglyceride
clever	adroit, artful, astute, cleverness, deft, erudite, shrewd
book	monograph, scholarly, treatise, tome, textbook, two-volume, nonfiction
loud	boisterous, cacophonous, clamorous, deafening,ear-splitting,resounded,sonorous,strident

jeremy · May 24, 2018, 5:15pm

Oh my, what you have done…

jeremy · May 24, 2018, 5:18pm

Actually perhaps some good can come of this - what if you created a “pretentiousness score” for a document based on this? Then averaged that across articles from writers in (for instance) the NY Times?

ravivijay · May 24, 2018, 5:45pm

Awesome post @er214. The idea to use an initial transformation vector to find more candidates and then revise the transformation vector looks like a cool example of semi-supervised learning.

I was wondering if looking at difference between fasttext and ULMFit’s wiki word embeddings will show what changes language model brings about. Any ideas ?

sebastianruder · May 24, 2018, 6:45pm

This is very cool, @er214! Thanks a lot for going the distance and writing this up! I’ll link to your post in the next NLP newsletter!

wgpubs · May 24, 2018, 6:57pm

Great write-up.

An example of one of the many reasons I recommend fast.ai forums as the best online DL community. Where else do original ideas like your’s get highlighted and published this quickly, especially when they spawn from what can at best be considered a loosely related question.

Ducky · May 24, 2018, 8:11pm

This is AWESOME work. I will cite it next week in a lightning talk on embeddings I’m giving.

You might also be interested in this paper on formality in language:
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.30.6280&rep=rep1&type=pdf

Among other things, it gives a formula for a formality score of a document

F = (noun frequency + adjective freq. + preposition freq. + article freq.
– pronoun freq. – verb freq. – adverb freq. – interjection freq. + 100)/2

This might be useful to you someday in a loss function. (I know pretty good part-of-speech taggers exist, but I don’t know what state of the art is.)

BTW, my personal belief, based on careful thought and not backed by research, is that formality is directly related to how easy it is for the audience to ask clarifying questions. Academic papers? Hard to ask questions (especially pre-email). Face-to-face speech? Really easy.

The ease of asking clarifying questions can be influenced by social status or rules, not just logistics. An audience with The Queen? Socially awkward to ask her questions. A court appearance? Difficult to ask questions of the judge.

Ducky · May 24, 2018, 9:27pm

Editorial comments on the post:

“We are now in a better position to explain why all the spelling mistakes been shunted into a parallel space?” should be “had been”.
“Equally, when we start reading percentage, economics, Government, etc, it’s reasonable to guess the text came from a new source and therefore will be spelled correctly” should be “news source”.
I would suggest putting the discussion of the “because” errors in between the “because” errors and the “and” errors. Keep stuff together.

Ducky · May 24, 2018, 9:30pm

I realized one way you could use the formality score would be make a better graph of “spelling correctness” vs. formality – your news vs. IMDB graph. While all (almost all?) of the news corpus will be formal, the IMDB corpus is going to be really variable. You will have some commentators carefully – pedantically even – formulating responses with exquisite attention to precise language, and other readers just hacking sumthin’ 2gether, ykwim?

Even · May 24, 2018, 11:14pm

Check out nmslib. It’s a little easier to install/use and if you’re not using it on GPU then it’s faster than faiss.

msp · May 25, 2018, 8:29am

You are really creative!

Could this be used to create more understandable variants of inscrutable texts?

For example, a layman’s version of some legal document, or of slang-heavy lyrics?