Unbalanced Data: Upsampling vs Downsampling

Hello all, this is my first post here but I’ve been casually following fast.ai on and off for the past year or so, and have gotten several sources of inspiration through some of the lecture videos. A quick about me: I’m currently in a research group at Berkeley Lab, where we are working on using deep learning techniques to predict (and hopefully prevent) suicides among veterans seeking medical care. My own focus specifically has been on using NLP approaches to glean predictions from clinical notes. In fact, I’m currently looking at applying @jeremy 's “ULMFit” technique as one approach to see if we can get any benefits out of transfer learning (the results are looking promising too).

With that out of the way, here’s a problem I’m wondering if anyone can help me with. My dataset is very unbalanced (most hospital patients don’t actually attempt suicide, of course). Only about 1% of the samples are positive labels. Now, the two most obvious ways to train on such an unbalanced dataset is via downsampling the training set (so randomly subsample negative samples to make the dataset balanced), or upsampling the training set (randomly sample the positive samples with replacement to make the dataset balanced). There are also smarter forms of data augmentation that I’d like to try, but don’t know much about (any pointers here would be appreciated!).

So I trained a series of models via downsampling and did really well (even averaging across multiple subsamples). Unfortunately when I do this the models don’t generalize very well to the whole dataset. I then tried to train the same models via upsampling, but then my metrics (F1, AUC) are crap, likely due to severe overfitting on the upsampled positive samples. Addressing overfitting doesn’t seem to help much if at all though.

My question then, which of the two approaches (or others) do you think will result in a better, more generalizable model? Moreover, do you think this is something that transfer learning COULD help deal with, or is it likely to suffer from the same problems? Thanks in advance!

1 Like

For a similar case (ours was with images though not nlp), we found downsampling to be quite good. We discard 99% of the negative samples for our case. You could also use some kind of pre-processing to remove many negative samples to reduce the load on the deep learning networks.

1 Like

You can try PyTorch’s WeightedRandomSampler to balance your dataset. The technique worked reasonably well for me for a moderately skewed distribution but the resulting learner was not very stable.

On the other hand, addressing overfitting by varying dropmult (dropout multiplier) or wd (weight decay) has not worked at all.

Currently, the default classification approach with ULMFiT is, what has been named, a PoolingLinearClassifier which is effectively a three layer feed-forward network but there’s a chance that deeper approaches or even cascading/boosting techniques may have something interesting to offer.

I’m trying to beat a high baseline score so I’d really like to know what works for you.

Yeah I thought about weighted sampling, but while that would help with memory load and performance on the upsampled model, it would still run into overfitting the positive samples I think. Apart from finding ways to remove noise from the dataset (like deciding which negative samples to drop), I’m thinking an ideal approach would be some form of data augmentation.

Have any of you tried the approaches used in the imbalanced-learn library? In particular, I wonder how effective techniques like SMOTE or ADASYN are at reducing overfitting when upsampling. But with that would involve perturbing the inputs slightly for the positive samples, which wouldn’t make much sense with NLP unless you do it after the embedding.

Exactly the same thoughts. Synthetic sampling does not quite make sense in NLP unless when applied to the embeddings (which is definitely worth experimenting).

Thinking out loud here: it should be possible to pull out the sentence embeddings from the saved encoder and then the synthetic samples can be batched with the rest of the data to be sent to the model. I think it will require re-writing get_rnn_classifer because in the current implementation, MultiBatchRNN is doing everything from mapping word embeddings to language modeling.

The only real problem I have with the approach is that it’s not obvious to me how you’d be able to fine-tune the embeddings as well if you generate synthetic data from those embeddings. Would cause some kind of semantic drift I think. Unless, thinking out loud now, you generate new samples after each iteration from the updated embeddings and use those for that iteration only. But that would be crazy slow to do because (in my case) that would mean generating 99% of my data each iteration so the batches balance. Thus, it seems like the easier case would be leaving the embeddings fixed. But with weight tying on the embeddings would I be messing anything up by doing that? Something to think about I guess.

Also an aside, since SMOTE uses KNN to generate the synthetic points, one would have to make sure to use cosine distance when generating the data (assuming the API supports that…surely it does). And if it works well, like the positive samples tend to cluster reasonably well, one would hope such a technique would be sort of like replacing words by synonyms or something reasonable, but I have absolutely no proof to back that up.

Another aside: I did see an example of SMOTE applied reasonably well to NLP from this Medium article. Big difference is that there it was used on TF-IDF embeddings instead of learned embeddings.

Are the embeddings also being tuned during the classifier training stage?

What I’m saying is while fine-tuning the classifier: 1. get the embeddings from the saved encoder; 2. apply SMOTE; 3. provide the synthetic + original samples to the data sampler and then everything else is the same as before except while training you take out the embedding layer because you already have the embeddings.

One potential point of failure I see is if the embeddings can not be mapped correctly to their respective labels. Also, this wouldn’t work if the embeddings are also being fine-tuned while the classsifier is training.

My understanding is that once all layers are unfrozen the embeddings update as well, at least I’ve been assuming this. Of course, you may not be doing that at all if you don’t have enough data, but I’ve seen some improvement personally from unfreezing all layers for a few epochs after fine tuning the last 2 layers. And yeah just extracting the embeddings should be pretty easy to do (though I’ve had more than a few headaches digging into the source code, so maybe I’m wrong).

In that case my recommendation might not quite be correct. Please do write an update if you find something that works.