Help recovering original data after proc_df() tranformations

whamp · July 10, 2018, 9:50pm

I’ve used proc_df to help standardize and label encode my dataframe but now that I have a working model I need to transform the predictions back into their original units so I can use them downstream in my process. How do I run an inverse_transform() on the mapper object that gets created in the scale_vars() function called by proc_df? Does anyone have experience with this?

I’ve checked the sklearn-pandas docs to see if there is any info on this but they don’t cover the topic of inverting the transformations https://github.com/scikit-learn-contrib/sklearn-pandas

After working for hard to clean data and build a successful model I’m definitely surprised to find this to be the part I’m stuck on!

Patrick · July 10, 2018, 11:37pm

Read back in your dataset?

It looks like proc_df does a couple operations (subsampling and dropping ignore_flds) in place before copying the data to avoid overwriting. Not sure why it doesn’t make a copy first so it doesn’t overwrite your dataset unless you explicitly do so yourself by naming the output of proc_df the same thing as the df argument of proc_df. I would be surprised to find that proc_df altered your original dataset other with a mapper unless you overwrote your original dataset.

whamp · July 11, 2018, 1:26am

hey thanks for the comment. I realized I should have been more clear. I’m trying to perform the same transformation on my predictions as were done to my training and test set. For my purposes i made sure there weren’t any NAs so I don’t need to worry about any new na columns and any blanks filled with the median. So the only thing i really need to reverse is the LabelEncoder() that gets called on all categorical feats (I didn’t one-hot anything because i used an embedding matrix in my neural net) and reverse the StandardScaler() that got called on my continuous features. Just meaning, I need the mean and standard deviation for each continuous column from the training set that was then applied to the test set so i can multiply by the standard deviation and add back the mean. The problem is I can’t find those values in the mapper that is created in proc_df and the inverse_transform() function doesn’t work on the mapper object either. That just leaves me with the option of manually standardizing all the columns and retraining my model from scratch which i would rather not do.

ecdrid · July 11, 2018, 4:01am

It’s going to be a lot of hectic work and no guarantee that it will get you the same cols …

Here’s how I do it,

First create a temp target col in your test set also,
After that run the proc_df func on df_test making sure to pass the Nas and the mapper retrieved from train set,
And you will get what you want…

But be careful, it can create a lot of unnecessary cols as well(it generally does and you have to manually drop them then)

whamp · July 11, 2018, 11:37pm

ok i’ll give that a try, thanks

ecdrid · July 13, 2018, 7:27pm

So did it work?

whamp · July 13, 2018, 8:44pm

it didn’t, although i may have done it incorrectly.

I did get it working though, here’s what i did instead.

I processed the test set the same as the train set all the way until it gets fed into proc_df. Then I made a copy of the dataframe under a different name. This wasn’t ideal because the files are huge, but it still worked. Then after training and generating predictions, I just appended the predictions to the copied dataframe that wasn’t scaled with proc_df. It turns out in the code that the y_flds actually don’t get scaled at all, so this way i didn’t have to recover the unscaled feature values and the y_values were already appropriate and thus didn’t need rescaling back from StandardScaler()

ecdrid · July 14, 2018, 5:09pm

That’s a brilliant way!!
Actually proc_df breaks when you have Nas in the test set rather than the train set also…

sravyaysk · October 4, 2019, 8:47am

Hi, I did the same way you had done. But what if y_flds is also scaled with proc_df.
i.e How to recover labels that are to be predicted after proc_df() transforming
As I used proc_df() transformation, it is giving numeric values as my predictions. How to recover the text data with respect to it