SentencePiece + ULMFit

binalpatel · September 5, 2018, 3:41am

Hi all -

As a personal project I’m trying to create a LM for clinical notes/patient notes. Whilst the text itself is in English there’s a lot of medical jargon/shorthand/typos that may make word-level models less than ideal.

After reading about ULMFiT I was considering doing the following, I’d love to get your feedback on whether it seems reasonable:

Use SentencePiece with the clinical text data to generate a tokenizer.
Rerun the ULMFiT code to tokenize and train a LM on Wikipedia (with the only major change being the tokenization).
Use that new LM and finetune on the actual clinical text data.

I’m sure if step 2 is superfluous or whether it’d help - it seems intuitively that “learning” English would be helpful, even if the domains don’t map 1 to 1.

One of my follow up experiments will then be to take the encodings generated via this model, and see if they help augment patient predictions for conditions/readmissions. I’m excited to try everything I’ve learned from Part 2 out

jeremy · September 6, 2018, 4:35am

Yes this is exactly the right approach. I’ve tried something similar and it worked great. Do let us know how you get along! BTW a SentencePiece vocab size of about 30k works quite well.

binalpatel · September 7, 2018, 1:38am

Of course! I’m going to kick it all off tomorrow! I’ll post updates on this forum as I start getting results.

king-mein · November 7, 2018, 4:31pm

hi.
How can i use SentencePiece in the current version of lib? can i use bpe embedder? there is somewhere code examples ? or hadcoding?

eisenjulian · November 10, 2018, 1:11am

FYI I started a thread on the dev channel with a possible implementation Adding SentencePieceTokenizer to fastai.text.data

wgpubs · May 26, 2019, 7:50pm

How did this work for you?

If I understand correctly, your approach was to first create your SentencePiece model using your clinical text data … and afterwards, you applied that trained SP model to tokenize wiki-103 (which you then trained our LM on)?

kcturgutlu · June 6, 2019, 6:00am

I am not able to see SentencePiece in fastai, is it never been added?

DG11 · November 18, 2020, 5:31pm

I’ve added to a DataBlock below:

dls_lm = DataBlock(blocks=TextBlock.from_df('text', is_lm=True, tok=SentencePieceTokenizer()),
                  get_x=ColReader('text'),
                  splitter=RandomSplitter(0.1)).dataloaders(df, bs=BS, seq_len=72, verbose=True)