Lesson 6 In-Class Discussion ✅

whamp · November 28, 2018, 2:59am

Should we be concerned that FillMissing() fills the median for all time periods in the train set which could cause data leakage issues? As opposed to fill by median by month for example ?

PierreO · November 28, 2018, 2:59am

Is this also the way to do a regression with computer vision ?

jcatanza · November 28, 2018, 3:00am

@crostino and @bholmer @sgugger In this case you can use a the hashing trick to categorify values into a set number of bins. A test data point with a value that is previously unseen in the training data will be assigned to one of the existing hash bins. Not sure if fastai library has implemented this process?

sandmann · November 28, 2018, 3:00am

Is there a way to encode ordinal variables, e.g. categories that have an order (e.g. “bad”, “good”, “best”)?

sgugger · November 28, 2018, 3:00am

procs is your list of tabular preprocessor you want to use (internally they’ll all be included in one single PreProcessor).

devforfu · November 28, 2018, 3:00am

So it is possible to pass another data type, instead of FloatList, right?

sgugger · November 28, 2018, 3:00am

Yes, the data block API is unified that way.

sgugger · November 28, 2018, 3:00am

You can customize the value used to fill missing.

KevinB · November 28, 2018, 3:01am

Is “-ify” suffix a fastai think or is that a python thing? I know I’ve seen listify in the fastai library as well.

hiromi · November 28, 2018, 3:01am

Even though it says “float” list, the values look like integers…?

sgugger · November 28, 2018, 3:01am

There’s not subcategories. It takes the full categories like [Jan,Feb] or [Feb,Mar]…

sgugger · November 28, 2018, 3:01am

If you want classification or multi-classification or something more crazy, yes.

agoldina · November 28, 2018, 3:01am

You can use get_dummies

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.get_dummies.html

THis way you can create columns to track whether they contain the month or not in the promo.

crostino · November 28, 2018, 3:01am

What happens if your training dataset has no unknown variables? The model will throw an error because it has never seen that value before? I guess a more general question is how to make your model robust enough to deal with data that it has not seen.

sgugger · November 28, 2018, 3:02am

FloatList is because you want to do regression. Even if your targets are ints.

sgugger · November 28, 2018, 3:02am

It’s a Jeremy thing, so we can say it’s a fastai thing

nithanaroy · November 28, 2018, 3:03am

Can you please elaborate when to use RMSE versus RMSPE (root mean sq. percent error) as loss functions?

wonderz44 · November 28, 2018, 3:03am

anyone else getting the error that nb_008 module not found in the rossman_data_clean notebook?

sgugger · November 28, 2018, 3:04am

What do you mean? If you have new fields in your datafame, yes the model will throw an error. If you have new values, they’ll be treated as unknown (because the model didn’t know them while training) and there won’t be any bug.

Mauro · November 28, 2018, 3:04am

Why was Jeremy sure just now that we would overfit?