As a personal project I’m trying to create a LM for clinical notes/patient notes. Whilst the text itself is in English there’s a lot of medical jargon/shorthand/typos that may make word-level models less than ideal.
After reading about ULMFiT I was considering doing the following, I’d love to get your feedback on whether it seems reasonable:
Use SentencePiece with the clinical text data to generate a tokenizer.
Rerun the ULMFiT code to tokenize and train a LM on Wikipedia (with the only major change being the tokenization).
Use that new LM and finetune on the actual clinical text data.
I’m sure if step 2 is superfluous or whether it’d help - it seems intuitively that “learning” English would be helpful, even if the domains don’t map 1 to 1.
One of my follow up experiments will then be to take the encodings generated via this model, and see if they help augment patient predictions for conditions/readmissions. I’m excited to try everything I’ve learned from Part 2 out
Yes this is exactly the right approach. I’ve tried something similar and it worked great. Do let us know how you get along! BTW a SentencePiece vocab size of about 30k works quite well.
If I understand correctly, your approach was to first create your SentencePiece model using your clinical text data … and afterwards, you applied that trained SP model to tokenize wiki-103 (which you then trained our LM on)?