I was curious if the pre training of the language model would work on text with different types of slang or shorthand language.
For example, medical text and notes, while written in English, are FAMOUS for using shorthand notations for diseases, events, etc. Doing NER (named entity recognition) for research on medical notes has been hard because of this. I’ve worked with the MIMICIII dataset before, which has a wonderful corpus of 55,000 medical notes. Doing NER on that is poor performing because of the differing terminology - for example GLoVe word embeddings only match 33% of the words listed there.
I’m curious if anyone has any thoughts as to whether it would be better to train the language model on the medical notes themselves, then doing NER on the “fine tuned” might be a good approach? or if training on more “traditional” medical english (PubMed for example) and then fine tune for the medical notes?
Being able to work with medical notes is the next “frontier” of research I believe, since only (for example) cancer treatment studies rely on only 1-3% of the entire disease population. This is a big area of research and application, so I was very happy to see the language models in Lesson 4!
Thanks!