Why it can be worth to balance tabular data in realworld cases

Pak · May 22, 2019, 3:44pm

Last couple of weeks I’ve been trying to apply my tabular fastai experiments on realworld problems in a company I work at.
Jeremy in one of his lectures mentioned that it worth try to balance your classes before start training. Now I know at least 1 reason why it can help.
Ok, let’s imagine we try to solve some realworld problem. For ex. let’s imagine we have some tabular data about a number of users, their behavior, some profiles’ data & etc. Now we want to classify them in the way it’s can benefit your business. It can be as “Will this user be interested in/buy a product” (is s/he a potential customer), as “Is this person a bot/potential intruder”. The common thing here is that most of all users that (for ex) visit you site won’t be the class you are interested in (most of the visitors won’t buy at your site, as well as they are fortunately not intruders). But detecting these classes (‘buyer’ and ‘intruder’), being tens or hundreds times less common, is the goal you want to achieve. But if we take loss function’s point of view, we will see that with such class imbalance, it’s more effective to just predict class 0 (non-intruder/non-buyer) as it is hundreds times more probable that next sample will be 0. And the algorithm of: “put 0 unless if you are very-very sure in the opposite” gives less loss function’s value. This shifts the network overall to class 0.
But good prediсtion of common users is not what we are trying to get. We try to get the pearls of class 1 (‘buyer’ or ‘intruder’) persons. So, despite original model is more accurate, balanced (with balanced classes) model is more practically useful, as it help to get more much more valuable class results. (Definitely is works only for cases where false prediction of class 1 is much more tolerable than visa versa)
We can achieve this with two ways: balance number of samples in each class or by altering the loss function (make loss value less if class 1 is correctly found), but balancing the dataset looks the easiest thing to do here.

As a result of the processing, I think, the most important problem appears to be is overfitting on highly duplicated samples. But I don’t see now how it can be avoided in any case.

Applying this approach for my company’s data I’ve noticed that distribution of classes for new data (which I don’t know the class) for balanced model better fit real data than original data model.

Ralph · May 22, 2019, 3:56pm

Over-weighting the rare target was critical in my real-world fraud model. I used a heavy dose of dropout to keep overfitting in check.

Pak · May 22, 2019, 4:28pm

I’ve also used more dropout as well as embedding dropout.
But maybe in my case it was not as critical as it could be because my valuable class ratio is 1 to 100. I assume that for fraud this ratio is even worse.

And how did you worked with imbalanced classes, have you changed loss function, or maybe some other method you have found useful for that case?

Ralph · May 22, 2019, 5:04pm

My target was <1 in 1000, so I filtered the targets, duped a bunch of times with a single copy of the non-targets and then shuffled.Nothing too fancy.

I did have one useful method that may not translate to other scenarios. A lot of transactions are blocked before they get to the fraud/no fraud decision point. A portion of the blocks are due to existing fraud rules, but 90% are for other reasons. I did a lot of training on this wider data to build my embeddings, and then fine tuned the model with just the trans that made it to the decision point. This helped capture granularity that would have been missing from the final data set.

Pak · May 27, 2019, 3:20pm

Oh, by the way I have forgot one thing. I’ve reduced maximum number of parameters for embeddings to 10 (from up to 600) like that
def emb_sz_rule(n_cat:int)->int: return min(10, round(1.6 * n_cat**0.56))

embedding_dict = {}

for column in cat_vars:
    embedding_dict[column] = emb_sz_rule(len(df[column].unique()))

learn = tabular_learner(data, layers=[1000,500], ps=[0.1,0.5], emb_drop=0.4, emb_szs=embedding_dict, metrics=accuracy)

The point is – today I’ve turned it back to 600 and it appear to overfit significantly more.
So I think it worth experimenting with that (reducing max embedding size as well as maybe reduce some dropout)

jankelowitz · July 31, 2019, 2:15pm

Hi Ralph,

Can you lend any advice on the best way to dupe the targets/non-targets. Is it as simple as just copying rows in the dataframe?

Thanks so much.

Ralph · July 31, 2019, 2:29pm

Yeah, I just filtered the targets into a new dataframe, then concatenated the original with multiple copies of the target df and shuffled.

With plenty of dropout to avoid overfitting…

jankelowitz · July 31, 2019, 6:45pm

thanks. I will try that, been working on a tabular dataset that’s giving me training error in the millions.