Tabular model on unbalanced data gives 0 precision and unimpressive performance

vishak · February 25, 2023, 3:29pm

Hello people.

I ran a tabular model on a big dataset of around 10 million rows (6 million train and rest valid) , used a batch size of 10k to reduce train time. (seemed reasonable)

highly unbalanced binary class data, (10% pos)
has just 6 categorical features with around 10 cats each, no continous features.

due to 1, I was careful to make the ratio 50:50 while training. I got an accuracy if 84% but got zero with precision,recall and F1 with error

UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 due to no predicted samples. Use zero_division parameter to control this behavior.

I then used test_dl and scored my model prediction using sklearns functions directly and got these numbers

precision 0.23889707467515692
recall 0.19972771038394624
f1 0.2175634727794161
accuracy 0.7811655568491515

These numbers are much lower than I got with LGBM.

So questions are

why did I get the precision error? Why did the accuracy change when I reran it with sklearn?
What am I doing wrong? How do I make my model perform better?

TIA!