Hi all -
Splitting off from the Language Model Zoo thread as Piotr suggested.
I’m training a LM on a publicly available medical dataset named MIMIC-III. I’d love to collaborate if anyone is interested. While MIMIC-III is publicly available, you have to go through a certification and training process before you can gain access, since it’s since patient information.
My goal is to create a LM that’s performs well enough, and generalizes well enough that it can be used as a base for other medical classification/prediction tasks without too much additional work.
There is a recent paper from Peter Liu of Google Brain (https://arxiv.org/abs/1808.02622) that explores creating a LM on the data, though we can’t directly compare any results we achieve here against there’s since it’s not clear if we can match their tokenization strategy.
I’ve committed my work so far to GitHub and will continue to do so as I try different experiments. I’ll try and write up some blog posts as well in the upcoming weeks.
Initial results have been promising, with a perplexity of 8.94 and accuracy of 57%. The model does also seem to be able to generate text that’s reasonably structured (even if clinically nonsensical). That’s after 10 epochs using use_clr_beta, which took ~13 hours to train from scratch.
s = 'blood loss ’
…of bilateral leg veins that was ruled out by inline ) it . had at thistime obtained w / o sign of carotid stenosis which represents anterior ventricular serosanguinous values , no obvious source of infection seen . t_up cxr suggestive of normal superior months.disp:*60 toxins . 2057 and active vs , pulses normal , no focal deficits . per team thought to be due to bloodculture , that is not consistent with an ischemic event and considering slow outpatient monitoring was felt to be secondary to her recent 4:42 ( right sided thigh hypotonic ) . iv access was
For now I’m taking samples of the data to generate the training/validation sets so I can train them in a reasonable amount of time. Eventually when I find something that works really well I’ll retrain using all the data.