Lesson 6 In-Class Discussion ✅

Should we be concerned that FillMissing() fills the median for all time periods in the train set which could cause data leakage issues? As opposed to fill by median by month for example ?

1 Like

Is this also the way to do a regression with computer vision ?

2 Likes

@crostino and @bholmer @sgugger In this case you can use a the hashing trick to categorify values into a set number of bins. A test data point with a value that is previously unseen in the training data will be assigned to one of the existing hash bins. Not sure if fastai library has implemented this process?

3 Likes

Is there a way to encode ordinal variables, e.g. categories that have an order (e.g. “bad”, “good”, “best”)?

4 Likes

procs is your list of tabular preprocessor you want to use (internally they’ll all be included in one single PreProcessor).

So it is possible to pass another data type, instead of FloatList, right?

1 Like

Yes, the data block API is unified that way.

1 Like

You can customize the value used to fill missing.

1 Like

Is “-ify” suffix a fastai think or is that a python thing? I know I’ve seen listify in the fastai library as well.

Even though it says “float” list, the values look like integers…?

1 Like

There’s not subcategories. It takes the full categories like [Jan,Feb] or [Feb,Mar]…

If you want classification or multi-classification or something more crazy, yes.

1 Like

You can use get_dummies

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.get_dummies.html

THis way you can create columns to track whether they contain the month or not in the promo.

4 Likes

What happens if your training dataset has no unknown variables? The model will throw an error because it has never seen that value before? I guess a more general question is how to make your model robust enough to deal with data that it has not seen.

1 Like

FloatList is because you want to do regression. Even if your targets are ints.

3 Likes

It’s a Jeremy thing, so we can say it’s a fastai thing :wink:

6 Likes

Can you please elaborate when to use RMSE versus RMSPE (root mean sq. percent error) as loss functions?

4 Likes

anyone else getting the error that nb_008 module not found in the rossman_data_clean notebook?

5 Likes

What do you mean? If you have new fields in your datafame, yes the model will throw an error. If you have new values, they’ll be treated as unknown (because the model didn’t know them while training) and there won’t be any bug.

Why was Jeremy sure just now that we would overfit?

1 Like