Tabular learner with one category different distribution

burgalon · July 21, 2021, 3:51pm

Hi all,

I’m trying to use Tabular learner, to fit a simple CSV which has something like date, user_id, clicks per month

The CSV also contains aggregated rows per of total clicks per month where user_id=null. When I’m not filtering these rows from the dataset, the MSE is 10 times higher, as if the model isn’t able to pick up this other distribution of category.

Any suggestions?

JackByte · July 21, 2021, 7:56pm

Hi @burgalon,

just a quick thought that came up in my mind: The other users might be clicking on average from 100 of 365. But the one user null only has 12 of 365 days. An embedding should’nt be messed up by this, as far as I understand it. But there might be some other features that could be impacted by this strange user that visits the page seldom, but then goes crazy each time

Oh and another thought… woudln’t that be a good feature instead of a record? You could at the total clicks per month to the columns based on the month the date belongs to.

Cheers

burgalon · July 23, 2021, 10:14am

hey @JackByte thank you for your insights. Indeed those are good points you’re making, and for some users, it’s possible that the data is sparse and noisy. Since this is a well known problem, I wonder what’s the best practice in this case. I might need to engineer another proxy aggregated feature, like say “country” instead of specific user… but then again, this will be less interesting of a model, and also I’d expect the embedding to catch those common patterns such as country.

I’m wondering if I should Normalize the aggregated data, and the per user samples separately, and maybe this will help the model generalize, as these samples are indeed of different distribution…