TabularPandas: Inconsistent row order after Categorify with random split

Pablo · May 11, 2023, 7:25am

Hi tombucky, I’m not sure that’s entirely the case. categorify does turn your values into int, but there’s more to it. Crucially, columns treated as categorical will get embeddings, while continuous columns will not. The point of using

cat_names=['nacol','col2'],

is precisely this: because some columns look like numerical but are indeed not (e.g., day of the week from 1 to 7).

Also, looking closely things seem to be working as intended. Note your splits:

(splits)
([3, 4, 0, 1], [2])

Example 2 ( 2.0 2 50 30) is in validation, which means the nacol value 2.0 is not found in any train examples, so it should be treated as an unknown (same as NaN, so index 0 for the embedding). Try adding more rows with repeated values to test this

However, I would expect you to get a new column “nacol_isna” or something like that, but I can’t exactly say from the top of my head why you didn’t.