NLP: Any libraries/dictionaries out there for fixing common spelling errors?

So actually applying things learned to a real-world corpus filled with spelling errors.

Most of these became apparent after looking at what vocabulary items weren’t already identified in the wiki103 vocab (it was close to 5,000 missing tokens).

For now, I’ve simply created a dictionary of almost 700 items (key=a regex to identify misspelled word, value=the replacement), but it is taking a long time to clean up my dataset of close to 500k documents.

Is there a python library or dictionary already out there that does a pretty good job of fixing misspellings? And is there an approach that would be faster than mine (e.g., takes about 25mins to process 24,000 documents)?

5 Likes

Not sure if this quite answers your question, but I think using a language model as the foundation of a spelling checker would be an interesting idea for a product.

3 Likes

There’s SymSpell, which had a Python implementation linked to here: https://github.com/wolfgarbe/SymSpell/blob/master/README.md

I don’t know how well the python port performs, but it has a reputation for being quick.

On the idea of using a language model, it’s an excellent idea, especially because it could potentially help with detecting real-word errors that can elude more basic approaches.

Yah, not quite an answer … BUT an excellent idea. I’m going to think on this for a bit.

Thanks for link. Will check it out and see how it works with my dataset.

Also not a direct answer, but relates to the suggestion of a language model. Assuming you are using a deep learning model of some sort with an embedding layer, you could try to create new vectors for the missing / miss-spelt words rather than correct them.

The trouble I have found with spelling corrections is that you get quite a lot of false positives thrown in appropriate corrections. The risk is that you might change the meaning of a sentence by changing the spelling of an unknown - rather than miss-spelt - word. (For example, proper nouns, i.e. names, often get corrected in this way).

What I’m currently trying is a relatively simple approach of calculating the average of the word vectors of known words within a specified window either side of the unknown words. I’m using the pre-trained Glove vectors.

A more complex option that I’m also trying (with limited success so far …) is to train a language model with a mask on the embedding layer that only allows updates to the unknown word vectors. So, initialise the unknown vectors using the averaging method mentioned above, and then train the language model with masked updating of the embedding layer in order to learn more appropriate embeddings for the missing / miss-spelt words.

My guess is that the problem with either approach is that you will often only have a few examples of the missing words, and therefore little to learn from.

Yup. That is why I’m burning a few hours manually going through the tokens in my vocab that wiki103 doesn’t know about, and for high-frequency known misspellings, replacing those values during pre-processing. For example, a common misspelling for “reliable” is “relieable” … and my intuition is that my models will perform better if I fix the spelling error rather than have the LM try to learn the misspelling token in addition to the correct spelling. I’ll try both and report back.

And let me know how your experimentation goes, both approaches sound interesting and we’re all in so much new territory here.

As an alternative, I was wondering about using word vector similarities to find corrections for common spelling errors. Pre-trained vectors like glove contain lots of spelling mistakes, and I thought these would be close
(ideally closest) to their correct counterparts.

It turns out that this isn’t true, but the reality is is more interesting - all the spelling mistakes are clustered together. So, for example, if you search for the nearest neighbours of “relieable” you get:
['relieable', 'relyable', 'realible', 'relable', 'reliabe', 'realiable', 'relaiable', 'relaible', 'trustworth', 'trustfull', 'consitant', 'stabel', 'accuarate', 'acurrate', 'accruate']

Note that the correct spelling isn’t anywhere to be seen. In fact, it’s miles away - there are 424,816 words closer (using cosine distance) to “relieable” than “reliable”!

So, I wondered if you could ‘correct’ a spelling by applying a transformation to move us from the spelling mistakes area of vector space to the correctly spelled words area. It turns out you can.

I’ve taken the average difference between the first 8 misspellings in the list above and “reliable”. This creates the transformation vector. We simply subtract this from the vector of a misspelled word to shift us into the correctly spelled words area, and then look for nearest neighbours.

The table below shows a few examples of the nearest neighbour, both of the incorrectly spelled word, and then of the transformed word.

misspelled word neighbours of misspelled word neighbours of transformed word
becuase becuase; becasue; beacuse; b/c; becouse because; even; fact; sure; though
definately definately; definetly; definatly; definitly; definitely definitely; sure; certainly; well; really
consistant consistant; consistantly; inconsistant; consistent; consitant consistent; reliable; consistant; consistently; accurate
pakage pakage; packge; pacage; pacakge; packege package; packages; pakage; reliable; offer
basicly basicly; basicaly; jsut; actualy; bascially basically; simply; just; actually; only
ocur ocur; occour; occurr; occure; happpen ocur; occur; arise; happen; reliably

In every case except the last, the closest neighbour to the transformed word is the correct spelling (in bold). Note also that the transformed results are skewed towards ‘reliable’. I’m sure you could get better results by building a transformation vector based on a wider sample than just misspellings of reliable.

Interesting that this is such a consistent result. It apears to sugest that speling erors and typoes strongly co-ocur, rather than appearing in isolation?

55 Likes

I’m also not sure if there is an easy, open-source way to do that - however I just came across https://www.perfecttense.com/ on the front page of https://news.ycombinator.com/shownew - they seem to have a free trial, so might be worth checking out!

Ed,

Building on what you are saying, I had been looking into this for a while and here is what I have found so far:

https://norvig.com/spell-correct.html - This is basically python spellcheck 101 which introduces some basic concepts in this space.

Underarmor Spell Check - Harry Xue at Underarmor suggests that Spelling is context sensitive (and I tend to agree with this line of thinking) . Words that are considered correct at one company/university/ etc could be considered to be incorrect at another organization. This is the reason why I believe that we haven’t seen a universal spell checker introduced (just more generic stuff that is prone to error).

• My thought is to perhaps perturb Wikipedia text with spelling errors on the input side and match that against correct Wikipedia on the output side. In essence, I am thinking an encoder to decoder approach that outputs the following:

  1. is this a word that is actually misspelled:
  • NASA is not incorrectly spelled and either is USA even though they are not typical words (they are abbreviations)
  • This is where I get hung up in my thinking. Encoder / decoder seems like right first approach but somehow I tend to believe that an attention model would increase performance.
  1. If the word is misspelled, what is the correct word that is the replacement:
  • Somehow this would involve iterating through your personal documents (or a companies or universities) and comparing this against the most likely word from Wikipedia.
  • I get a little hung up in my thinking here because for this second part ( I somehow believe that sense embeddings might be related to this problem).

I would be more than happy to start a github on this if there is sufficient interest in working on this.

Best,

Ralph

4 Likes

I think your comment got cut off @RAB

Ok - Fixed my original post.

On the speed issue, you don’t necessarily have to fix the spelling errors as a one-off preprocessing step. You could leave the source data as is, and then just have an extended version of the dictionary that converts from word/token to integer ID. You then correct the spellings by mapping from the incorrect spelling to the ID of the correctly spelled word.

Alternatively, if you do need to preprocess, then working with tokenized text would be quicker as it would allow you to avoid regular expressions. Build a dictionary that maps from original spelling => correct spelling for all tokens in the source text. Then you can convert each token without using any regex.

1 Like

This is mind-blowing.

6 Likes

BTW @er214 I really hope you consider writing this up in a little medium post or similar - it’s so fascinating and cool.

5 Likes

“I’ve taken the average difference between the first 8 misspellings in the list above and “reliable”. This creates the transformation vector. We simply subtract this from the vector of a misspelled word to shift us into the correctly spelled words area, and then look for nearest neighbours.”
This is a great idea, would work if the spelling mistakes are frequent, i.e. enough examples exist to obtain context. A way to “generating” enough examples is applying meaningful transformations to the raw data and using the approach above(Incase of word2vec, transform the words in the context window?) someways of generating these transformations are transpositions, keyboard proximity based substitutions etc. Using these there should be a way to make the language model “robust” to spelling errors.

2 Likes

The point of the post, as I understand it, is that this “spelling correction vector” turns out to be the same across all words. So it’s already done - no more words are required! (I’m sure there’s some opportunity to fine-tune it a bit, but the results already look amazingly accurate.)

2 Likes

This is really interesting!

I’m going to start doing some experimentations on other word vectors (e.g., w2v, even wiki103 though I’m not sure how many, if any, misspellings exist there) and see if the results are similar. Lmk if you might be up for a summer project incorporating this into a LM backed model that functions as a spell checker. I have some ideas.

-wg

That’s amazing! You could easily write a well-cited academic paper with this result.

By the way, what’s interesting is that the cluster seems to contain not only misspellings specifically, but more generally, variations of the word. For example, “b/c” is a commonly used and arguably correct substitute for “because”.
In order to single out misspellings, perhaps the model needs access to the character-level representation of the text as well.

1 Like

Some cool ideas in this thread! :slight_smile: @er214, as mentioned before, please keep track of your experimental results and consider writing this up, either as a blog post or academic paper. Let me know if you’d like any feedback.

Some ideas around work on zero-shot learning of ‘nonce’ and OOV words might also be relevant (I’ve written a paragraph about that here).

Another thing I would recommend is: Identify a dataset that has been previously used for evaluating spelling correction methods and conduct your experiments on that one to show that your model meaningfully improves upon previous work.

1 Like