ULMFiT - Medical/Clinical Text

(Binal Patel) #1

Hi all -

Splitting off from the Language Model Zoo thread as Piotr suggested.

I’m training a LM on a publicly available medical dataset named MIMIC-III. I’d love to collaborate if anyone is interested. While MIMIC-III is publicly available, you have to go through a certification and training process before you can gain access, since it’s since patient information.

My goal is to create a LM that’s performs well enough, and generalizes well enough that it can be used as a base for other medical classification/prediction tasks without too much additional work.

There is a recent paper from Peter Liu of Google Brain (https://arxiv.org/abs/1808.02622) that explores creating a LM on the data, though we can’t directly compare any results we achieve here against there’s since it’s not clear if we can match their tokenization strategy.

I’ve committed my work so far to GitHub and will continue to do so as I try different experiments. I’ll try and write up some blog posts as well in the upcoming weeks.

Initial results have been promising, with a perplexity of 8.94 and accuracy of 57%. The model does also seem to be able to generate text that’s reasonably structured (even if clinically nonsensical). That’s after 10 epochs using use_clr_beta, which took ~13 hours to train from scratch.

s = 'blood loss ’
sample_model(m, s)
…of bilateral leg veins that was ruled out by inline ) it . had at thistime obtained w / o sign of carotid stenosis which represents anterior ventricular serosanguinous values , no obvious source of infection seen . t_up cxr suggestive of normal superior months.disp:*60 toxins . 2057 and active vs , pulses normal , no focal deficits . per team thought to be due to bloodculture , that is not consistent with an ischemic event and considering slow outpatient monitoring was felt to be secondary to her recent 4:42 ( right sided thigh hypotonic ) . iv access was

For now I’m taking samples of the data to generate the training/validation sets so I can train them in a reasonable amount of time. Eventually when I find something that works really well I’ll retrain using all the data.


(Sudarshan) #2

I am very interested. I’m working on my PhD with MIMIC. My idea is to use the pre-trained LM and see how a non-medical pre-trained corpus helps with medical data. I’m also looking at other methods at representing MIMC data. Would love to chat!

I’m actually writing the LM code from scratch for a couple of reasons. To use the latest Pytorch version and just learn the whole thing. If you’re interested maybe we can chat about the code and approach and help each other out.

1 Like

(Binal Patel) #3

I’d love to chat and learn more about your approach, and what you’re working on. One thing that I do plan on eventually trying is using SentencePiece to learn a tokenization strategy on MIMIC -> tokenize Wikipedia and pretrain the LM -> apply that back to Mimic (much like the original ULMFiT paper.

I’ll shoot you a message.


(Binal Patel) #4

Ran another experiment run on Paperspace last night with the same underlying dataset, this time lower casing all the text, and adding a BOS and EOS token before and after every piece of text (though in hindsight I realize having both is likely redundant).


I did end up getting slightly worse results, with a final validation set log-perplexity of 2.31 and accuracy of 55% (whereas with preserving case information I had a log-perplexity of 2.18 and accuracy of 57%).

My next run will be using a SentencePiece tokenizer instead of a word level tokenizer. Sudarshan also has some great MIMIC specific code that cleans up redacted data and replaces it with tokens instead, I’ll be trying that out as well. Afterwards I’ll select the best performing of the attempts and train them longer on all the data (as well as train a backwards model as well). As Jeremy pointed in the Language Model Zoo thread, it may be advantageous to ensemble different models (word level and sub-word level) to get the best result.


(Binal Patel) #5

Using SentencePiece’s unigram algorithm with a vocab size of 60k on lower cased text performed pretty well. I’ll have to verify as I use it more - but I did have to use a lower learning rate than the other word-level models I’ve been experimenting with on the same corpus/splits to get good results.

The interesting thing about this one especially was that in generating samples it sometimes tended to “misspell” by concatenating words together, or every once in a while create great sounding but completely made up medical terms and drug names. It also seemed to capture context better, more frequently entering in redacted information correctly within context (at least from the anecdotal sampling I’m doing).

Going to kick off a full data training run within the next week or two, likely using the SentencePiece tokenizer (maybe with a smaller vocab size), with some cleaning up of the text, and preserving case.



(Jeremy Howard (Admin)) #6

Thanks for the great updates! Note that comparing LM metrics across different vocabs isn’t necessarily meaningful - you really need a downstream task (e.g. classification) and compare that accuracy.


(Binal Patel) #7

Definitely - a lot of my evaluation thus far has been qualitative on generated text and not too rigorous :).

I’ll be evaluating on the downstream task of predicting readmissions (possibly mortality). I’m hoping to tie a lot of this work back to the healthcare work I do professionally, where some of the most useful information on a patient record could very well be unused free text data.


(Mohamed Abdallah) #8

Hello @binalpatel ,

I am very interested in this project. are you still working on it ? can we have a chat?


(Binal Patel) #9

Hi! I’m planning on picking this back up and would be happy to chat. I’ll post an update soon - I’m aiming to get a pretrained model out on the MIMIC III text within the next two weeks or so, redoing some of the work/logic I did originally using fastai V1.

1 Like

(Mohamed Abdallah) #10

Hello binalpatel,

I am working on finetuning transformer-xl on medical dataset , also I`m planning to implement this paper
https://arxiv.org/pdf/1808.02622.pdf .
I am happy to chat and exchange ideas.


(Binal Patel) #11

That’s great, trying out the Transformer-XL arch is next on my roadmap right after I get something working decently with ULMFiT. Do you already have access to the MIMIC III data? Or are you working with another dataset? I haven’t been able to find a good “publicly” available medical NLP dataset other than MIMIC III’s notes.


(Anish Dalal) #12

I’ve also been experimenting with language modeling for clinical text and found this useful site that has over 5000 example physician dictations (broken down by speciality) that were transcribed to text (https://www.mtsamples.com/). Hope this helps and happy to chat and collaborate.