Pre training a language model on medical slang text for research? (slightly advanced)

I was curious if the pre training of the language model would work on text with different types of slang or shorthand language.

For example, medical text and notes, while written in English, are FAMOUS for using shorthand notations for diseases, events, etc. Doing NER (named entity recognition) for research on medical notes has been hard because of this. I’ve worked with the MIMICIII dataset before, which has a wonderful corpus of 55,000 medical notes. Doing NER on that is poor performing because of the differing terminology - for example GLoVe word embeddings only match 33% of the words listed there.

I’m curious if anyone has any thoughts as to whether it would be better to train the language model on the medical notes themselves, then doing NER on the “fine tuned” might be a good approach? or if training on more “traditional” medical english (PubMed for example) and then fine tune for the medical notes?

Being able to work with medical notes is the next “frontier” of research I believe, since only (for example) cancer treatment studies rely on only 1-3% of the entire disease population. This is a big area of research and application, so I was very happy to see the language models in Lesson 4!


I think train on as many things as possible, starting with the least specific (wikipedia) and ending with the most specific (doctors’ notes). BTW, if you’re not following my tweets, you may have missed this new medical NER paper from yesterday:

I am following! But I missed it, thanks for sharing!!!

Thanks for the tips on training, that’s interesting and helpful - I’ll give it a shot. The shorthand spelling of doctors notes provides an interesting challenge.

y, shnd splg dr’s nts is int chlng!