Last couple of weeks I’ve been trying to apply my tabular fastai experiments on realworld problems in a company I work at.
Jeremy in one of his lectures mentioned that it worth try to balance your classes before start training. Now I know at least 1 reason why it can help.
Ok, let’s imagine we try to solve some realworld problem. For ex. let’s imagine we have some tabular data about a number of users, their behavior, some profiles’ data & etc. Now we want to classify them in the way it’s can benefit your business. It can be as “Will this user be interested in/buy a product” (is s/he a potential customer), as “Is this person a bot/potential intruder”. The common thing here is that most of all users that (for ex) visit you site won’t be the class you are interested in (most of the visitors won’t buy at your site, as well as they are fortunately not intruders). But detecting these classes (‘buyer’ and ‘intruder’), being tens or hundreds times less common, is the goal you want to achieve. But if we take loss function’s point of view, we will see that with such class imbalance, it’s more effective to just predict class 0 (non-intruder/non-buyer) as it is hundreds times more probable that next sample will be 0. And the algorithm of: “put 0 unless if you are very-very sure in the opposite” gives less loss function’s value. This shifts the network overall to class 0.
But good prediсtion of common users is not what we are trying to get. We try to get the pearls of class 1 (‘buyer’ or ‘intruder’) persons. So, despite original model is more accurate, balanced (with balanced classes) model is more practically useful, as it help to get more much more valuable class results. (Definitely is works only for cases where false prediction of class 1 is much more tolerable than visa versa)
We can achieve this with two ways: balance number of samples in each class or by altering the loss function (make loss value less if class 1 is correctly found), but balancing the dataset looks the easiest thing to do here.
As a result of the processing, I think, the most important problem appears to be is overfitting on highly duplicated samples. But I don’t see now how it can be avoided in any case.
Applying this approach for my company’s data I’ve noticed that distribution of classes for new data (which I don’t know the class) for balanced model better fit real data than original data model.