Help recovering original data after proc_df() tranformations


(Will) #1

I’ve used proc_df to help standardize and label encode my dataframe but now that I have a working model I need to transform the predictions back into their original units so I can use them downstream in my process. How do I run an inverse_transform() on the mapper object that gets created in the scale_vars() function called by proc_df? Does anyone have experience with this?

I’ve checked the sklearn-pandas docs to see if there is any info on this but they don’t cover the topic of inverting the transformations https://github.com/scikit-learn-contrib/sklearn-pandas

After working for hard to clean data and build a successful model I’m definitely surprised to find this to be the part I’m stuck on!


#2

Read back in your dataset?

It looks like proc_df does a couple operations (subsampling and dropping ignore_flds) in place before copying the data to avoid overwriting. Not sure why it doesn’t make a copy first so it doesn’t overwrite your dataset unless you explicitly do so yourself by naming the output of proc_df the same thing as the df argument of proc_df. I would be surprised to find that proc_df altered your original dataset other with a mapper unless you overwrote your original dataset.


(Will) #3

hey thanks for the comment. I realized I should have been more clear. I’m trying to perform the same transformation on my predictions as were done to my training and test set. For my purposes i made sure there weren’t any NAs so I don’t need to worry about any new na columns and any blanks filled with the median. So the only thing i really need to reverse is the LabelEncoder() that gets called on all categorical feats (I didn’t one-hot anything because i used an embedding matrix in my neural net) and reverse the StandardScaler() that got called on my continuous features. Just meaning, I need the mean and standard deviation for each continuous column from the training set that was then applied to the test set so i can multiply by the standard deviation and add back the mean. The problem is I can’t find those values in the mapper that is created in proc_df and the inverse_transform() function doesn’t work on the mapper object either. That just leaves me with the option of manually standardizing all the columns and retraining my model from scratch which i would rather not do.


(Aditya) #4

It’s going to be a lot of hectic work and no guarantee that it will get you the same cols …

Here’s how I do it,

First create a temp target col in your test set also,
After that run the proc_df func on df_test making sure to pass the Nas and the mapper retrieved from train set,
And you will get what you want…

But be careful, it can create a lot of unnecessary cols as well(it generally does and you have to manually drop them then)


(Will) #5

ok i’ll give that a try, thanks


(Aditya) #6

So did it work?


(Will) #7

it didn’t, although i may have done it incorrectly.

I did get it working though, here’s what i did instead.

I processed the test set the same as the train set all the way until it gets fed into proc_df. Then I made a copy of the dataframe under a different name. This wasn’t ideal because the files are huge, but it still worked. Then after training and generating predictions, I just appended the predictions to the copied dataframe that wasn’t scaled with proc_df. It turns out in the code that the y_flds actually don’t get scaled at all, so this way i didn’t have to recover the unscaled feature values and the y_values were already appropriate and thus didn’t need rescaling back from StandardScaler()


(Aditya) #8

That’s a brilliant way!!
Actually proc_df breaks when you have Nas in the test set rather than the train set also…