ULMFiT - Medical/Clinical Text

Hi all -

Splitting off from the Language Model Zoo thread as Piotr suggested.

I’m training a LM on a publicly available medical dataset named MIMIC-III. I’d love to collaborate if anyone is interested. While MIMIC-III is publicly available, you have to go through a certification and training process before you can gain access, since it’s since patient information.

My goal is to create a LM that’s performs well enough, and generalizes well enough that it can be used as a base for other medical classification/prediction tasks without too much additional work.

There is a recent paper from Peter Liu of Google Brain (https://arxiv.org/abs/1808.02622) that explores creating a LM on the data, though we can’t directly compare any results we achieve here against there’s since it’s not clear if we can match their tokenization strategy.

I’ve committed my work so far to GitHub and will continue to do so as I try different experiments. I’ll try and write up some blog posts as well in the upcoming weeks.

Initial results have been promising, with a perplexity of 8.94 and accuracy of 57%. The model does also seem to be able to generate text that’s reasonably structured (even if clinically nonsensical). That’s after 10 epochs using use_clr_beta, which took ~13 hours to train from scratch.

s = 'blood loss ’
sample_model(m, s)
…of bilateral leg veins that was ruled out by inline ) it . had at thistime obtained w / o sign of carotid stenosis which represents anterior ventricular serosanguinous values , no obvious source of infection seen . t_up cxr suggestive of normal superior months.disp:*60 toxins . 2057 and active vs , pulses normal , no focal deficits . per team thought to be due to bloodculture , that is not consistent with an ischemic event and considering slow outpatient monitoring was felt to be secondary to her recent 4:42 ( right sided thigh hypotonic ) . iv access was

For now I’m taking samples of the data to generate the training/validation sets so I can train them in a reasonable amount of time. Eventually when I find something that works really well I’ll retrain using all the data.


I am very interested. I’m working on my PhD with MIMIC. My idea is to use the pre-trained LM and see how a non-medical pre-trained corpus helps with medical data. I’m also looking at other methods at representing MIMC data. Would love to chat!

I’m actually writing the LM code from scratch for a couple of reasons. To use the latest Pytorch version and just learn the whole thing. If you’re interested maybe we can chat about the code and approach and help each other out.

1 Like

I’d love to chat and learn more about your approach, and what you’re working on. One thing that I do plan on eventually trying is using SentencePiece to learn a tokenization strategy on MIMIC -> tokenize Wikipedia and pretrain the LM -> apply that back to Mimic (much like the original ULMFiT paper.

I’ll shoot you a message.

Ran another experiment run on Paperspace last night with the same underlying dataset, this time lower casing all the text, and adding a BOS and EOS token before and after every piece of text (though in hindsight I realize having both is likely redundant).


I did end up getting slightly worse results, with a final validation set log-perplexity of 2.31 and accuracy of 55% (whereas with preserving case information I had a log-perplexity of 2.18 and accuracy of 57%).

My next run will be using a SentencePiece tokenizer instead of a word level tokenizer. Sudarshan also has some great MIMIC specific code that cleans up redacted data and replaces it with tokens instead, I’ll be trying that out as well. Afterwards I’ll select the best performing of the attempts and train them longer on all the data (as well as train a backwards model as well). As Jeremy pointed in the Language Model Zoo thread, it may be advantageous to ensemble different models (word level and sub-word level) to get the best result.

Using SentencePiece’s unigram algorithm with a vocab size of 60k on lower cased text performed pretty well. I’ll have to verify as I use it more - but I did have to use a lower learning rate than the other word-level models I’ve been experimenting with on the same corpus/splits to get good results.

The interesting thing about this one especially was that in generating samples it sometimes tended to “misspell” by concatenating words together, or every once in a while create great sounding but completely made up medical terms and drug names. It also seemed to capture context better, more frequently entering in redacted information correctly within context (at least from the anecdotal sampling I’m doing).

Going to kick off a full data training run within the next week or two, likely using the SentencePiece tokenizer (maybe with a smaller vocab size), with some cleaning up of the text, and preserving case.


Thanks for the great updates! Note that comparing LM metrics across different vocabs isn’t necessarily meaningful - you really need a downstream task (e.g. classification) and compare that accuracy.


Definitely - a lot of my evaluation thus far has been qualitative on generated text and not too rigorous :).

I’ll be evaluating on the downstream task of predicting readmissions (possibly mortality). I’m hoping to tie a lot of this work back to the healthcare work I do professionally, where some of the most useful information on a patient record could very well be unused free text data.

Hello @binalpatel ,

I am very interested in this project. are you still working on it ? can we have a chat?

Hi! I’m planning on picking this back up and would be happy to chat. I’ll post an update soon - I’m aiming to get a pretrained model out on the MIMIC III text within the next two weeks or so, redoing some of the work/logic I did originally using fastai V1.

1 Like

Hello binalpatel,

I am working on finetuning transformer-xl on medical dataset , also I`m planning to implement this paper
https://arxiv.org/pdf/1808.02622.pdf .
I am happy to chat and exchange ideas.


That’s great, trying out the Transformer-XL arch is next on my roadmap right after I get something working decently with ULMFiT. Do you already have access to the MIMIC III data? Or are you working with another dataset? I haven’t been able to find a good “publicly” available medical NLP dataset other than MIMIC III’s notes.

I’ve also been experimenting with language modeling for clinical text and found this useful site that has over 5000 example physician dictations (broken down by speciality) that were transcribed to text (https://www.mtsamples.com/). Hope this helps and happy to chat and collaborate.

I’ve met so many people with access to MIMIC III, including myself, that I believe that it is not hard to get access. You will be asked to complete a free online course on patient privacy, and that was not difficult. MIMIC III could create a great basis for shared projects or multiple projects trying to solve a common problem.

Hi all,

I also have a project with clinical notes at my institution, and I want to move to MIMIC III so we can all use the same baseline of data to compare our progress.

In my original work, I was trying to identify cases where the patient had a current illness of “shingles”. My thought was to use the basic ULMFiT approach, and train our larger corpus of notes on a language model, and then use transfer learning and 100 annotations to detect “shingles”. This problem is of interest to some colleagues in the Rheumatology department who work with immune compromised patients, and those patients have a high incidence of shingles. So if I could make it work, one clinical research team would be happy.

To make the problem more realistic, I limited my fine-tuned classifier to labels for only cases for which shingles-like substrings were found in the text (with our SQL database). It’s easy to identify the candidate cases to this subset, because our SQL database is plenty competent at finding substrings. In this case I included any note with one of the following substrings: “shingles”, “zoster” or “post-herpetic neuralgia”.

When I manually created the labels, I found that in the above set of cases, the most common classes were:

  • shingles in the current illness. Most often I knew it was the current illness because it was discussed in the “history of present illness” or the date of “shingles” was within a few days of the date of the clinical note
  • past medical history of shingles. These were in the “past medical history” section and/or had dates far in the past (at least a month or more)
  • Post-Herpetic neuralgia - This is a long-lasting complication of shingles, and is a nightmare in its own right. If your shingles happens when you are weak on your immunity (especially older people or immune compromised people) the rash goes away, but not the pain. The pain can be fairly extreme, and it can last for months or years, and there is no medication that really makes it go away. But for my researchers, this isn’t what they were looking for, so I counted this as a separate (negative) category since it happens weeks or months later, by definition.
  • Shingles vaccination - many docs chart that they recommend a shingles (or zoster) vaccination for their patient. I also used this class for people who actually got a shingles vaccination, and this was usually included in a section called “immunizations” or something similar, and included a list of other vaccinations that they have had.
  • shingles lab test - some very immunocompromised people had a blood test for evidence of the zoster virus in their blood (antibodies or virus particles).
    With these labels on 195 patients, I trained the model on 80% and validated on 20%.

When I did the basic fine-tuning and classification phase, the model wasn’t too good at finding shingles as the “current illness”. When I looked at the “attention” head, most of the time, it wasn’t even focusing on the mention of shingles. The problem is that I just said “learn to predict these labels” without telling the model that I was looking for “shingles”. Each history-and-physical note was often mentioning 100-200 additional medical problems, and I gave it nothing to know to focus on the shingles/zoster phrases.

On my second try, I’m going to need to tell the model:

  • Look for mentions of the concept “shingles”
  • If you find the concept, make note of what section of the document you found the concept
  • If possible, find a date for the concept, and compare it with the date of the note, so you will know whether it is a “current” case or a case from the past.

This made me realize what has already been talked about in the literature. To build a predictive model that understands a clinical note, it should determine the following things in the note:

  • It should identify mentions of clinical concepts (disease, signs, symptoms, lab tests and their values, procedures, etc)
  • It should know what section of the note those concepts were mentioned. Some typical sections are chief compliant, current illness, past medical history, family history, Immunizations (if present), physical exam, assessment/impression, plan (for care management, testing, treatment). These sections are unreliably labeled, so you need a chunk of the network to match the pattern common with each of these sections
  • is there uncertainty about the observed concept? For instance, if a patient has a story that is consistent with a myocardial infarction, but they don’t have definitive proof, the providers will admit the patient to the ICU and say “Rule Out Myocardial Infarction” or just “R/O MI”. This means they don’t know for sure, but the risk of death if they discharge the patient home with a MI is so great that they admit the patient with the concept that “they might have an MI” and it needs to be “ruled-out” with further tests and time before it is safe for them to go home. There is a lot of uncertainty in medicine and it is important for the reader to understand the terms that reflect uncertainty.
  • If possible, it should identify the date associated with that concept, including approximate dates like “last summer”, “2015”, etc.
  • It should flag a “negative” mention of the concept, e.g., “no chest pain”. When the provider affirms that something was not found, it means the question was asked, and the answer was “no”. These are usually asked for clinical concepts that are very important in diagnosing the patients problem or level of severity, so they shouldn’t be viewed as “NULL” or no mention.
  • in the family history, there may be a mention of the father dying of cancer, etc. It is important to note that this concept (cancer) does not apply to the patient but to a family member.
  • If the concept is a lab test, it should identify the value of that lab test (as distinct from other lab tests)

The model should be prepared to deal with 100-200 such concepts in the same clinical note.

These properties are easy to find by the provider reading the note, but hard for a computer to identify, and hard to associated which concept the property should belong to. So this is a perfect problem for Deep Learning, and I’m very excited that with the tools we have, we or somebody can build a model that can do this.

The current “state of the art” is that most algorithms use keywords and rules, and they do pretty well at identifying the concepts mentioned in the note (60% to 80% accuracy), but the accuracy on identifying these other properties, including dates, is very low. If a net could understand these things at a human-level, this would unlock the majority of clinical information in the electronic medical record, and make many downstream predictions possible.

For anyone who finds this fun and rewarding, I would like to hand-label a bunch of clinical notes in MIMIC III for each of these issues above, and welcome everyone who can register for access to MIMIC III to work on building a network that can extract these concepts and properties. I am a physician, just barely and not practicing, but I believe I can get support from expert physicians at my institution, to make this a meaningful exercise. My goal is that we would publish anything that the experts find useful, and include the contributors as co-authors. In addition, any of you would be free to publish or blog on your experiences, and this would potentially be useful in advancing your reputation in DL and or clinical DL.

The MIMIC web site has a collection of challenges already listed. I believe that they encourage people to create their own challenges that can be hosted and publicized on the web site. I think it would be cool if we could do that, and participate as either several teams or several individuals, as meets your personal preferences. If we can do this, we can create a performance baseline on MIMIC III, on which anyone can build, both now and in the future, and this can advance the state of the art in this field.


Here is a great step-by-step article that makes it easy to apply for access.


It is indeed a very interesting problem.
I am also working on something similar. Is there a way to explicitly teach the model to look for something specific? Like “shingles”?

@binalpatel do you plan on releasing the trained weights publicly? It will be really helpful

I think there should be, but I don’t have anything working so far. My first try was to use the ULMFiT approach and build a big language model with all the notes, and then manually create about 200 document-level labels to flag “shingles” from no shingles. That didn’t work too well and I think it is because there were too many other things going on in the note besides shingles.


In the case of weights trained on a language model on MIMIC III, there is a chance that physionet.org would permit that. It’s even more likely that they would give me a place to share the weights among people who are registered to use MIMIC III. I haven’t had time from my day-job to look into that yet.

By the way, physionet.org just published a big batch of X-Rays, along with the free text diagnostic reports on each X-Ray. They are really doing great work!


Looks like an interesting topic of research. I will try to spend some time working on it! Thanks !