So this essentially means, that each customer ID should have multiple entries to calculate loss for customer column. Am i correct? If this statement is right, then (in general) the columns which we want system to learn should always have multiple entries for that ID.
I have another stupid question: What if all my columns in the table are categorial variables and no single continuous variables. Because during the class following question was asked:
Question: Are embeddings suitable for certain types of variables? [01:02:45]
Jeremy’s Answer: Embedding is suitable for any categorical variables. The only thing it cannot work well for would be something with too high cardinality. If you had 600,000 rows and a variable had 600,000 levels, that is just not a useful categorical variable. But in general, the third winner in this competition really decided that everything that was not too high cardinality, they put them all as categorical. The good rule of thumb is if you can make a categorical variable, you may as well because that way it can learn this rich distributed representation; where else if you leave it as continuous, the most it can do is to try and find a single functional form that fits it well.
So does this mean continuous variables are essentials to fit the data?
Apologies for asking so many questions!