SentencePiece + ULMFit

Hi all -

As a personal project I’m trying to create a LM for clinical notes/patient notes. Whilst the text itself is in English there’s a lot of medical jargon/shorthand/typos that may make word-level models less than ideal.

After reading about ULMFiT I was considering doing the following, I’d love to get your feedback on whether it seems reasonable:

  1. Use SentencePiece with the clinical text data to generate a tokenizer.
  2. Rerun the ULMFiT code to tokenize and train a LM on Wikipedia (with the only major change being the tokenization).
  3. Use that new LM and finetune on the actual clinical text data.

I’m sure if step 2 is superfluous or whether it’d help - it seems intuitively that “learning” English would be helpful, even if the domains don’t map 1 to 1.

One of my follow up experiments will then be to take the encodings generated via this model, and see if they help augment patient predictions for conditions/readmissions. I’m excited to try everything I’ve learned from Part 2 out :slight_smile:


Yes this is exactly the right approach. I’ve tried something similar and it worked great. :slight_smile: Do let us know how you get along! BTW a SentencePiece vocab size of about 30k works quite well.


Of course! I’m going to kick it all off tomorrow! I’ll post updates on this forum as I start getting results.

1 Like

How can i use SentencePiece in the current version of lib? can i use bpe embedder? there is somewhere code examples ? or hadcoding? :slight_smile:

FYI I started a thread on the dev channel with a possible implementation Adding SentencePieceTokenizer to fastai/

1 Like

How did this work for you?

If I understand correctly, your approach was to first create your SentencePiece model using your clinical text data … and afterwards, you applied that trained SP model to tokenize wiki-103 (which you then trained our LM on)?

I am not able to see SentencePiece in fastai, is it never been added?

I’ve added to a DataBlock below:

dls_lm = DataBlock(blocks=TextBlock.from_df('text', is_lm=True, tok=SentencePieceTokenizer()),
                  splitter=RandomSplitter(0.1)).dataloaders(df, bs=BS, seq_len=72, verbose=True)