Should we be concerned that FillMissing() fills the median for all time periods in the train set which could cause data leakage issues? As opposed to fill by median by month for example ?
Is this also the way to do a regression with computer vision ?
@crostino and @bholmer @sgugger In this case you can use a the hashing trick to categorify values into a set number of bins. A test data point with a value that is previously unseen in the training data will be assigned to one of the existing hash bins. Not sure if fastai library has implemented this process?
Is there a way to encode ordinal variables, e.g. categories that have an order (e.g. “bad”, “good”, “best”)?
procs is your list of tabular preprocessor you want to use (internally they’ll all be included in one single PreProcessor).
So it is possible to pass another data type, instead of FloatList, right?
Yes, the data block API is unified that way.
You can customize the value used to fill missing.
Is “-ify” suffix a fastai think or is that a python thing? I know I’ve seen listify in the fastai library as well.
Even though it says “float” list, the values look like integers…?
There’s not subcategories. It takes the full categories like [Jan,Feb] or [Feb,Mar]…
If you want classification or multi-classification or something more crazy, yes.
You can use get_dummies
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.get_dummies.html
THis way you can create columns to track whether they contain the month or not in the promo.
What happens if your training dataset has no unknown variables? The model will throw an error because it has never seen that value before? I guess a more general question is how to make your model robust enough to deal with data that it has not seen.
FloatList is because you want to do regression. Even if your targets are ints.
It’s a Jeremy thing, so we can say it’s a fastai thing
Can you please elaborate when to use RMSE versus RMSPE (root mean sq. percent error) as loss functions?
anyone else getting the error that nb_008 module not found in the rossman_data_clean notebook?
What do you mean? If you have new fields in your datafame, yes the model will throw an error. If you have new values, they’ll be treated as unknown (because the model didn’t know them while training) and there won’t be any bug.
Why was Jeremy sure just now that we would overfit?