Predicting on the test set for submissions

nishanm · October 28, 2017, 8:39pm

@jeremy, I am trying to transform test data based on the transformations train_cats( ) and proc_df( ) done on training data.

Similar to apply_cats( ) for train_cats( ) what is the applicable transformation with proc_df( )?
I was just wondering, looking at apply_cat( ) it seems that you need to pass both the train and test data. In a case where a test data is comparatively large as the train data do you think this function is efficient (my AWS instance got stalled due to this)?
What is the best way to get rid of a situation when AWS instance get stalled due to a memory issue when running a large data set?

parrt · October 28, 2017, 9:57pm

Related to #2 this is definitely required so inefficient or not, gotta do it.

Heh, I wonder what happens with apply_cats() when the test set has a cat value that does not exist in the training set? Time for an experiment.

parrt · October 28, 2017, 10:16pm

Whoops. it gets NA for new categories:

parrt · October 28, 2017, 10:24pm

Expensive maybe, but this works:

nishanm · October 28, 2017, 10:33pm

Btw I figured that AWS instance got stalled not due to the apply_cats( ). Apparently converting numerical (int, float) columns to string type columns in a large panda data frame is expensive.

I was using (cat col refers to columns I want to convert to strings)
df_test[cat_col] = df_test[cat_col].astype(str)

Above result in memory issues

This works better
for col in cat_col:
df_test[col] = df_test[col].astype(str)

parrt · October 28, 2017, 10:34pm

I suggest having read_csv() do the conversion for you with dtype arg.

nishanm · October 28, 2017, 10:38pm

Yeah! thanks for that

jeremy · October 28, 2017, 11:03pm

It set’s it as ‘N/A’, which will be a pandas code of -1, and then becomes 0 in proc_df

jeremy · October 28, 2017, 11:04pm

This is probably what you want. If you include categories that aren’t in the training set, they won’t appear in your RF, which can lead to odd results.

jeremy · October 28, 2017, 11:07pm

You still use proc_df, since now that you’ve got the categories set up, you don’t need different logic for the test set. You can just add a column of zeros to be your ‘dependent variable’ since for now proc_df assumes there is one. Better still, edit proc_df to make the dependent variable optional, such that if it’s not provided as a parameter, it’s not removed/returned - and then submit a pull request to the fastai repo so everyone will benefit!

Hint: this is really handy for creating a PR: https://github.com/github/hub

Probably reboot using the AWS console, if there’s so much memory pressure you can’t even use ssh. If you can use ssh, just check top to see the proc id of the bad process, and kill -9 it.

jeremy · October 28, 2017, 11:09pm

BTW, it’s a good idea to set your full name in your forum preferences, so I know who you are in class!

nishanm · October 28, 2017, 11:17pm

Thanks Jeremy, I actually used proc_df similar to train data with a zero column after posting my question. One concern area was now we will be using test data median for imputation rather the train data median. I guess this is just a minor consideration?

jeremy · October 29, 2017, 8:41pm

That’s a great point - I hadn’t thought of that. May well not be minor at all! We better come up with a better way to handle this…