NLP: Any libraries/dictionaries out there for fixing common spelling errors?

Hi, excellent! Is there a typo in your formula for Germany,

vector(Paris) — vector(France) + vector(Berlin)

Shouldn’t it be as follows?

vector(France) — vector(Paris) + vector(Berlin)

I think the blog post is great, congrats! I find the idea of the dot product really clever.

Here are a couple of suggestions, several easy resources that you could use as tools:

  1. dictionaries
  2. string distance

Concerning point 1: you assume that the n most frequent words are likely spelled correctly, which is a reasonable assumption, however you could just use the subset of most frequent words which appear in a dictionary.

Concerning point 2: when gathering misspellings, you assume that the 10 closest items to the vector sum are misspellings. However, you could easily treat those as candidates, and check whether they are actually misspellings by using some string distance between the candidate and the initial word (such as Levenshtein edit distance, or similar). That way you would also not need to limit yourself to 10, but could get a large number of candidates and only retain the most likely.

Well spotted! So much for my careful proof read…

Thanks!

On the second point, that is roughly what I’ve done in the code: look at edit distance (but only from 10 neighbours) and reject anything that requires more than 2 edits. You also need to reject plurals and different endings for verbs, so I’ve written a bit of hacky code to get this done. I’ll add a few words to clarify in the blog.

On the first point, I completely agree, using an existing word list would have been more sensible. With hindsight I’m not sure why I didn’t.

@er214 this is great! Minor things:

  • 2nd paragraph: know–>known
  • Add your twitter handle to Medium so that sharing it automatically at-mentions you.

Thanks @jeremy!

Some more fun results.

Given that the misspellings phenomenon looks like it’s driven by differences in the distributions in the source data, I wondered if you might be able to do other transformations. In particular, could you create a transformation vector to turn my turgid businesslike writing into flowery, poetic prose. Someone responded to Jeremy’s tweet wondering if you could build a transformation to make you sound smarter. Well, in a round about sort of way, it looks like you can.

Those who use long tricky words do tend always to use them, whereas people like me with a more limited vocabulary don’t often use ‘resplendent’ or ‘supercilious’ in our writing. If this rarefied vocabulary only crops up frequently in particular sources, then it will probably be in a different bit of the vector space.

So, I’ve tried building a transformation (a pretentiousness transformation?) by taking the average between pairs of words that mean roughly the same thing, but one of them is souped-up. For example,

('impoverished', 'poor'),
('diffident', 'shy'),
('congenial', 'agreeable'),
('droll', 'witty'),
('valiant', 'brave'),
('servile', 'dutiful')

And so on. The resulting transformation vector shares characteristic spikes in 2 dimensions with the spelling vector, but not in the way you might expect. The ‘souping-up’ direction is the same as the misspelling direction. I think this is because the ‘best spelled’ end of the spelling direction is based on business news sources, which tend to stick to fairly straightforward language.

To generate interesting ‘souped-up’ alternatives for words I’ve applied the transformation (after chopping off these spikes it shares with the misspelling vector, otherwise we start creating misspellings), taken the nearest neighbours, and then removed from this set the nearest neighbours of the un-transformed query word. This just gives you better results. For example, if we use the word ‘give’,

  • the nearest neighbours are ‘giving’, ‘make’, ‘let’, ‘want’, ‘you’, ‘get’, ‘take’, ‘bring’, ‘need’;
  • the transformed neighbours are ‘furnish’, ‘sufficient’, ‘compel’, ‘provide’, ‘bestow’, ‘obtain’, ‘afforded’, ‘confer’, ‘proffer’

It doesn’t always work, often giving you rubbish, but here are few examples of the options this method gives you:

query fancy sounding alternatives
smelly fusty, foul-smelling, odiferous, odorous, malodorous, fetid, putrid, reeking, mildewy
funny comical, droll, hilariously, humourous, ironical, irreverent, sardonic, witty
fat adipose, carbohydrate, cholesterol, corpulent, rotund, triglyceride
clever adroit, artful, astute, cleverness, deft, erudite, shrewd
book monograph, scholarly, treatise, tome, textbook, two-volume, nonfiction
loud boisterous, cacophonous, clamorous, deafening,ear-splitting,resounded,sonorous,strident
10 Likes

Oh my, what you have done…

:wink:

Actually perhaps some good can come of this - what if you created a “pretentiousness score” for a document based on this? Then averaged that across articles from writers in (for instance) the NY Times?

2 Likes

Awesome post @er214. The idea to use an initial transformation vector to find more candidates and then revise the transformation vector looks like a cool example of semi-supervised learning.

I was wondering if looking at difference between fasttext and ULMFit’s wiki word embeddings will show what changes language model brings about. Any ideas ?

This is very cool, @er214! Thanks a lot for going the distance and writing this up! :clap: I’ll link to your post in the next NLP newsletter! :slight_smile:

2 Likes

Great write-up.

An example of one of the many reasons I recommend fast.ai forums as the best online DL community. Where else do original ideas like your’s get highlighted and published this quickly, especially when they spawn from what can at best be considered a loosely related question. :slight_smile:

This is AWESOME work. I will cite it next week in a lightning talk on embeddings I’m giving. :slight_smile:

You might also be interested in this paper on formality in language:
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.30.6280&rep=rep1&type=pdf

Among other things, it gives a formula for a formality score of a document

F = (noun frequency + adjective freq. + preposition freq. + article freq.
– pronoun freq. – verb freq. – adverb freq. – interjection freq. + 100)/2

This might be useful to you someday in a loss function. (I know pretty good part-of-speech taggers exist, but I don’t know what state of the art is.)

BTW, my personal belief, based on careful thought and not backed by research, is that formality is directly related to how easy it is for the audience to ask clarifying questions. Academic papers? Hard to ask questions (especially pre-email). Face-to-face speech? Really easy.

The ease of asking clarifying questions can be influenced by social status or rules, not just logistics. An audience with The Queen? Socially awkward to ask her questions. A court appearance? Difficult to ask questions of the judge.

1 Like

Editorial comments on the post:

  • “We are now in a better position to explain why all the spelling mistakes been shunted into a parallel space?” should be “had been”.
  • “Equally, when we start reading percentage, economics, Government, etc, it’s reasonable to guess the text came from a new source and therefore will be spelled correctly” should be “news source”.
  • I would suggest putting the discussion of the “because” errors in between the “because” errors and the “and” errors. Keep stuff together.

I realized one way you could use the formality score would be make a better graph of “spelling correctness” vs. formality – your news vs. IMDB graph. While all (almost all?) of the news corpus will be formal, the IMDB corpus is going to be really variable. You will have some commentators carefully – pedantically even – formulating responses with exquisite attention to precise language, and other readers just hacking sumthin’ 2gether, ykwim?

Check out nmslib. It’s a little easier to install/use and if you’re not using it on GPU then it’s faster than faiss.

You are really creative! :slight_smile:

Could this be used to create more understandable variants of inscrutable texts?

For example, a layman’s version of some legal document, or of slang-heavy lyrics?

1 Like

A quick update on the Pretentiator:

I’ve tried scoring a few articles from different sources all relating to Trump’s brilliant handling of the summit with North Korea. Results as follows - in ascending order of linguistic flair!

source score
CNN 0.026314
BBC 0.034686
Guardian 0.042723
New York Times 0.044835
Washington Post 0.047775
New Yorker 0.050230
Financial Times 0.051934
Reuters 0.058019

I wasn’t expecting the FT and Reuters to feature at the top of this leader-board. I think its probably has something to do with the tiny sample size (1 article), which is then skewed by certain quotes within the text.

As a further test I’ve tried it on a very reliable source of wordy prose - the London Review of Books (something my more learned friends enjoy). A random article from here scored a whopping 0.089, so I guess the measure does work a bit.

By the way, the score is the average per word score above a certain threshold, i.e. it is just measuring the average frequency of particularly high scoring words.

3 Likes

https://github.com/sermakarevich/jigsaw-toxic-comment-classification-challenge/blob/master/modelling.ipynb def unify_tokens(comment) this is how we deal with spelling errors in Toxic Comment Competition.

1 Like

Good to hear from you @sermakarevich. I was just looking over your toxic notebook a few days ago … good stuff.

@sermakarevich Regarding toxic comments competition: I had the impression that good end result was heavily dependent on preprocessing and not so much on the actual nn model. Can you comment on that? Did you experiment of how big of an impact was made by prep step vs model? Thank you for sharing you work! Really appreciate, I was struggling with this competition myself also :slight_smile: