Hello all, this is my first post here but I’ve been casually following fast.ai on and off for the past year or so, and have gotten several sources of inspiration through some of the lecture videos. A quick about me: I’m currently in a research group at Berkeley Lab, where we are working on using deep learning techniques to predict (and hopefully prevent) suicides among veterans seeking medical care. My own focus specifically has been on using NLP approaches to glean predictions from clinical notes. In fact, I’m currently looking at applying @jeremy 's “ULMFit” technique as one approach to see if we can get any benefits out of transfer learning (the results are looking promising too).
With that out of the way, here’s a problem I’m wondering if anyone can help me with. My dataset is very unbalanced (most hospital patients don’t actually attempt suicide, of course). Only about 1% of the samples are positive labels. Now, the two most obvious ways to train on such an unbalanced dataset is via downsampling the training set (so randomly subsample negative samples to make the dataset balanced), or upsampling the training set (randomly sample the positive samples with replacement to make the dataset balanced). There are also smarter forms of data augmentation that I’d like to try, but don’t know much about (any pointers here would be appreciated!).
So I trained a series of models via downsampling and did really well (even averaging across multiple subsamples). Unfortunately when I do this the models don’t generalize very well to the whole dataset. I then tried to train the same models via upsampling, but then my metrics (F1, AUC) are crap, likely due to severe overfitting on the upsampled positive samples. Addressing overfitting doesn’t seem to help much if at all though.
My question then, which of the two approaches (or others) do you think will result in a better, more generalizable model? Moreover, do you think this is something that transfer learning COULD help deal with, or is it likely to suffer from the same problems? Thanks in advance!