SentencePiece + ULMFit

(Binal Patel) #1

Hi all -

As a personal project I’m trying to create a LM for clinical notes/patient notes. Whilst the text itself is in English there’s a lot of medical jargon/shorthand/typos that may make word-level models less than ideal.

After reading about ULMFiT I was considering doing the following, I’d love to get your feedback on whether it seems reasonable:

  1. Use SentencePiece with the clinical text data to generate a tokenizer.
  2. Rerun the ULMFiT code to tokenize and train a LM on Wikipedia (with the only major change being the tokenization).
  3. Use that new LM and finetune on the actual clinical text data.

I’m sure if step 2 is superfluous or whether it’d help - it seems intuitively that “learning” English would be helpful, even if the domains don’t map 1 to 1.

One of my follow up experiments will then be to take the encodings generated via this model, and see if they help augment patient predictions for conditions/readmissions. I’m excited to try everything I’ve learned from Part 2 out :slight_smile:

8 Likes

Adding SentencePieceTokenizer to fastai.text.data
(Jeremy Howard (Admin)) #2

Yes this is exactly the right approach. I’ve tried something similar and it worked great. :slight_smile: Do let us know how you get along! BTW a SentencePiece vocab size of about 30k works quite well.

3 Likes

(Binal Patel) #3

Of course! I’m going to kick it all off tomorrow! I’ll post updates on this forum as I start getting results.

1 Like

(Anton) #4

hi.
How can i use SentencePiece in the current version of lib? can i use bpe embedder? there is somewhere code examples ? or hadcoding? :slight_smile:

0 Likes

(Julian Eisenschlos) #5

FYI I started a thread on the dev channel with a possible implementation Adding SentencePieceTokenizer to fastai/text.data

1 Like

(WG) #6

How did this work for you?

If I understand correctly, your approach was to first create your SentencePiece model using your clinical text data … and afterwards, you applied that trained SP model to tokenize wiki-103 (which you then trained our LM on)?

0 Likes

(Kerem Turgutlu) #7

I am not able to see SentencePiece in fastai, is it never been added?

0 Likes