I ran a tabular model on a big dataset of around 10 million rows (6 million train and rest valid) , used a batch size of 10k to reduce train time. (seemed reasonable)
- highly unbalanced binary class data, (10% pos)
- has just 6 categorical features with around 10 cats each, no continous features.
due to 1, I was careful to make the ratio 50:50 while training. I got an accuracy if 84% but got zero with precision,recall and F1 with error
UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 due to no predicted samples. Use
zero_divisionparameter to control this behavior.
I then used test_dl and scored my model prediction using sklearns functions directly and got these numbers
precision 0.23889707467515692 recall 0.19972771038394624 f1 0.2175634727794161 accuracy 0.7811655568491515
These numbers are much lower than I got with LGBM.
So questions are
- why did I get the precision error? Why did the accuracy change when I reran it with sklearn?
- What am I doing wrong? How do I make my model perform better?