Dates and ULMfit

xraycat · January 4, 2019, 5:31pm

I’m working with UMLfit and the smoking challenge from i2b2 (medical records). In the dataset, there are a lot of dates. Does any of you guys a good idea of how to deal with that?

Tchotchke · January 4, 2019, 6:59pm

Could you be more specific and maybe provide an example? For example, do you mean that within a body of unstructured text you have dates, such as “On Monday, January 22 the patient walked in the emergency room presenting with…”. Or do you mean that the data has a separate date field that you would like to incorporate in some way?

It would also be helpful to provide a link to the challenge that you reference, assuming it’s publicly available.

xraycat · January 4, 2019, 7:55pm

Yes, of course, sorry for not being clear. The training set and test set consist of approximately 500 discharge summaries (approx. 400-1000 words per summery), and the task is to find the smoking status for the patients. It is a multiclassification problem where the labels are either current-smoker, non-smoker, past smoker or unknown.

The discharge summaries could look like (The example is made up, but is similar to the real data):

anonymized patient jmhvbukc454
5/28/1993 12:00:00 AM

ADMISSION DATE :

5-28-97

DISCHARGE DATE :

06-04-1993
…
…
…
DISCHARGE MEDICATIONS :
Patient is being sent home on Valium and hydrocodone 0.500 mg every three days…
…
…
…

So yes, it’s unstructured text and for example, not all patients have the header “DISCHARGE MEDICATIONS,” And in the example above dates are not necessarily in the same format.

I think I could be interesting to use umlfit on this relatively small and very unstructured dataset. but I think and hope that umlfit will perform pretty well.

Tchotchke · January 6, 2019, 6:37pm

Thanks for the additional detail - that’s an interesting question. I’ve actually been doing a fair amount of work with ULMFiT on different datasets since Jeremy and Sebastian released the paper.

My initial thought is that with the way it is structured (at least for the admission and discharge date) you could remove that, as I don’t think it would be providing any information (you could of course test that hypothesis by running it once with the dates and once without).

xraycat · January 7, 2019, 11:32pm

Tried to replace the dates with _date and deleting it. Also tried to clean the notes in different ways. I think I may help a little. But it was difficult to get good accuracy. With some basic regex and ulmfit i could get an f-score of maximum .8. The top algorithms in the competition back in 2007 had a score of >0.9 as far as I remember, so a little bit of feature engineering is still needed (or maybe I don’t know how to tune it correctly). I also tried to isolate the sentences that contained “tobacco/smoke etc.” And it seems like negation is difficult. Like: “she denies the use of tobacco,” “she does not drink, use iv drugs or smoke” or “she is a nonsmoker, nondrinker…”.

I’m thinking about writing my masters thesis in NLP (I’m biomedical eng. student) and my supervisor proposed my project could be based on the MIMIC-III database. Have you tried to use it on that database?

Tchotchke · January 8, 2019, 1:59am

I have not used that database before - it seems like it could be an interesting source of data.

Proper tuning, particularly of the learning rate, for ULMFiT can have a big impact. Are you using the learning rate finder? Per Jeremy’s suggestion, you can keep training so long as your target metric keeps improving (i.e., don’t just rely on loss)

It would be pretty surprising if ULMFiT couldn’t surpass results from 2007 - on the different datasets I’ve applied it against I’ve found really good performance.