Yup. That is why I’m burning a few hours manually going through the tokens in my vocab that wiki103 doesn’t know about, and for high-frequency known misspellings, replacing those values during pre-processing. For example, a common misspelling for “reliable” is “relieable” … and my intuition is that my models will perform better if I fix the spelling error rather than have the LM try to learn the misspelling token in addition to the correct spelling. I’ll try both and report back.
And let me know how your experimentation goes, both approaches sound interesting and we’re all in so much new territory here.