Predicting on the test set for submissions

@jeremy, I am trying to transform test data based on the transformations train_cats( ) and proc_df( ) done on training data.

  1. Similar to apply_cats( ) for train_cats( ) what is the applicable transformation with proc_df( )?

  2. I was just wondering, looking at apply_cat( ) it seems that you need to pass both the train and test data. In a case where a test data is comparatively large as the train data do you think this function is efficient (my AWS instance got stalled due to this)?

  3. What is the best way to get rid of a situation when AWS instance get stalled due to a memory issue when running a large data set?

1 Like

Related to #2 this is definitely required so inefficient or not, gotta do it.

Heh, I wonder what happens with apply_cats() when the test set has a cat value that does not exist in the training set? Time for an experiment.

1 Like

Whoops. it gets NA for new categories:

Expensive maybe, but this works:

Btw I figured that AWS instance got stalled not due to the apply_cats( ). Apparently converting numerical (int, float) columns to string type columns in a large panda data frame is expensive.

I was using (cat col refers to columns I want to convert to strings)
df_test[cat_col] = df_test[cat_col].astype(str)

Above result in memory issues

This works better
for col in cat_col:
df_test[col] = df_test[col].astype(str)

I suggest having read_csv() do the conversion for you with dtype arg.

Yeah! thanks for that

It set’s it as ‘N/A’, which will be a pandas code of -1, and then becomes 0 in proc_df

This is probably what you want. If you include categories that aren’t in the training set, they won’t appear in your RF, which can lead to odd results.

1 Like

You still use proc_df, since now that you’ve got the categories set up, you don’t need different logic for the test set. You can just add a column of zeros to be your ‘dependent variable’ since for now proc_df assumes there is one. Better still, edit proc_df to make the dependent variable optional, such that if it’s not provided as a parameter, it’s not removed/returned - and then submit a pull request to the fastai repo so everyone will benefit!

Hint: this is really handy for creating a PR:

Probably reboot using the AWS console, if there’s so much memory pressure you can’t even use ssh. If you can use ssh, just check top to see the proc id of the bad process, and kill -9 it.

BTW, it’s a good idea to set your full name in your forum preferences, so I know who you are in class! :slight_smile:

Thanks Jeremy, I actually used proc_df similar to train data with a zero column after posting my question. One concern area was now we will be using test data median for imputation rather the train data median. I guess this is just a minor consideration?

That’s a great point - I hadn’t thought of that. May well not be minor at all! We better come up with a better way to handle this…