Structured.py - proc_df function

dgovender · November 22, 2017, 3:19am

I have a few questions related to the proc_df function in structured.py

The docstring does not contain an explanation of the mapper argument. How is the mapper argument used in this function?
The docstring indicates that three variables are returned, but the Rossman notebook indicates that mapper also needs to be unpacked. If I unpack this function to three variables it returns `ValueError: too many values to unpack (expected 3)’. Does the docstring need to be updated or am I misinterpreting the function?

wgpubs · November 22, 2017, 4:55am

I think you need to git pull the latest and then …

> conda env update
> conda update --all

The codebase is under constant development and the introduction of the mapper is rather new.

soorajviraat · November 22, 2017, 5:40am

The mapper variable lets you know the values ( mean and standard deviation ) used for scaling of variables. This happens because in the Rossman example, he used do_scale = True. If you set it to false, it will only return 3 variables. This can be seen in the result ( res variable) in the proc_df function. The mapper variables gets added only when do_scale =True.

res = [pd.get_dummies(df, dummy_na=True), y, na_dict]
if do_scale: res = res + [mapper]
return res

The fastai library is used for machine learning too which normally doesnt need scaling of variables. I think the docstring was made from that perspective. But applying neural nets needs normalization. Yes you are right, it needs to be updated.

jeremy · November 22, 2017, 6:06pm

Thanks for the nice explanation. PR for the docstring would be most welcome!

ecdrid · November 22, 2017, 6:37pm

just did that…(my first PR)

dgovender · November 22, 2017, 7:15pm

Thank you @soorajviraat. Is it correct that the mapper variable is required to make predictions on an unscaled set of data? If so, how is this done? Let me know if there are set of lectures in the ML course that address these questions.

ecdrid · November 22, 2017, 8:28pm

It’s required to scale the test data according to the what has been done in training time…

jeremy · November 22, 2017, 9:20pm

Thanks for the PR @ecdrid!

ecdrid · November 23, 2017, 1:42pm

My pleasure…

Changed it a bit further to make things transparent

Brad_S · February 17, 2018, 1:32pm

To use other scalers, would we have to basically make another version of proc_df?
If I’ve understood correctly, you wouldn’t want to use the proc_df scaler with a non-normal feature or one with outliers. right/wrong?

Can anyone explain how the techniques in the ML course lesson 2 notebook (Random Forest interpretation) would be affected by the use of a scaler? (I’m assuming it wouldn’t as the tree splits on everything, but if we drop based on unscaled, might we miss things other models can see?)

I came across the sklearn robustscaler, which it would seem to me would be good to use with data of varying distributions and quality. I’m still jumping around (and repeating) the DL and ML lectures, but as far as I can tell, this concept hasn’t come up