Is it Necessary to Pre-process all Features the same way?

whamp · August 5, 2018, 4:10am

So I’ve been thinking through a structured data model I’ve built using the fast.ai library that appears to be working. One thing I noticed and pointed out in a separate thread was that changing the loss criterion mid training seems to help the model a lot. Discussion here:

One hypothesis for why this works so well is it helps get around distributional issues in my data. This got me thinking about my pre-processing steps and what I could do to improve them to help the model learn without having to change learning criteria mid training.

So the "core’ of my continuous data are 16 features that have an intrinsic relationship to one another. These 16 features are also used as the targets (when forward lagged in time). 14 of these range from [0,large] and 2 of these range from [-very large,+very large] but with 90% of observations lying between [-medium,+medium]. For the 14 features that are always positive, taking the log works really well to make them look much more normally distributed. What is stumping me is what to do with the other 2 features that can and do take on negative values. Is it ok to take the log of some features and not others? Won’t that be destroying any information contained in the correlation between series that have been logged and those that have not ? Maybe I’m overthinking it and should just try a bunch of stuff ?

fyi, The pre-processing I’m doing in the working model is to just scale all features by a single relevant scalar value and then use sklearn StandardScaler on all of the features

Any ideas would be appreciated!

radek · August 5, 2018, 7:23am

It is okay to preprocess the features in different ways. The important bit is to make sure you apply the same preprocessing to your test data as you do to your train data. The more complex the preprocessing you apply the higher the chance of applying it incorrectly or there occurring information leakage.

Sklearn has this very neat notion of pipelines - declaring a pipeline of transformations and applying it first to train and then applying the pipeline to the test set.