I have a general question about oversampling and I’m hoping the community can point me in the right direction. My current understanding of oversampling tabular data is that if I have 100 records of one type and 1 or 2 records of another type, I should copy and paste the less common record until I have the same number of each. So the example I’m thinking about is a dataset that has 100 healthy individuals and 1 that is sick. If I use the oversampling approach, I will end up with my sick identifier as whatever my 1 sick person has. So assume instead of 1 person, there are 10 sick people in my dataset. I see the same issue arising since there are still 10 copies of each sick person so instead of having the 1 record of a 25 year-old male with brown hair that is sick, this is shown in the model as 10 people so if you assume that there were two 25 year-old males with brown hair in the dataset, 1 sick and 1 not sick, this would be put into the model and instantly the model would assume you are sick because there are so many other not sick people and only 10 sick people and you hit those identifiers.
Yes after over-sampling you need to adjust the threshold at inference time - it won’t be 0.5 any more!
Here is an interesting article on SMOTE which is an oversampling technique that seems pretty cool: https://beckernick.github.io/oversampling-modeling/
and here is the paper introducing SMOTE (from 2002): https://arxiv.org/pdf/1106.1813.pdf
This seems like the most popular way to oversample and it is easy to do, just make sure to only oversample the training set, not the validation set.