Hi all -
As a personal project I’m trying to create a LM for clinical notes/patient notes. Whilst the text itself is in English there’s a lot of medical jargon/shorthand/typos that may make word-level models less than ideal.
After reading about ULMFiT I was considering doing the following, I’d love to get your feedback on whether it seems reasonable:
- Use SentencePiece with the clinical text data to generate a tokenizer.
- Rerun the ULMFiT code to tokenize and train a LM on Wikipedia (with the only major change being the tokenization).
- Use that new LM and finetune on the actual clinical text data.
I’m sure if step 2 is superfluous or whether it’d help - it seems intuitively that “learning” English would be helpful, even if the domains don’t map 1 to 1.
One of my follow up experiments will then be to take the encodings generated via this model, and see if they help augment patient predictions for conditions/readmissions. I’m excited to try everything I’ve learned from Part 2 out