PCA feature extraction

Jehalladay · March 29, 2022, 5:01am

Hi, I am using the fastai library in some research for network security since it has a convenient tabular data interface, and I really love it. I like to use various feature extraction techniques, but unfortunately I havent seen anything on the website applicable to tabular data.

Normally, I would like certain feature extraction techniques like PCA to occur within a pipeline so that we dont leak data, meaning we train a pca model on the training data, produce feature from it to augment the training data, and then use the same model to augment the validation data. So I searched the forum and found no one had addressed tabular feature extraction inside the TabularPandas pipeline and then spent a few hours trying to implement something that works.

The method I came up with to do this is to pass a proc to TabularPandas that extracts the pca features and then concatenates n of them with the existing features. However, I cannot figure out a way to apply the proc to the categorical features because they arent encoded yet when the proc is activated (I think). The continuous features have already gone through normalization beforehand, but categorify seems to have a lazy activation or something.

The proc is functional however, so if any of you would like to implement PCA feature extraction or a similar method (auto-encoders, tsne, etc) on continuous data, it would be fairly easy to modify this code. I have only tested this on tabular data since I dont handle other forms that often, so It probably will not work for visual learners.

Can anyone help me figure out why the categorical features havent been encoded or tell me a way to access the encoded data?

from sklearn.decomposition import PCA
from fastcore.all import store_attr, Transform


class PCA_tabular(Transform):
    '''
        Class will implement a PCA feature extraction method for tabular data
        On setup, we train a pca on the training data, then extract n_comps from the entire dataset
            the components are then added to the dataframe as new columns
    '''

    def __init__(self, n_comps=3, add_col=True):
        store_attr()

    def setups(self, to, **kwargs):
        self.pca = PCA(n_components=n_comps)
        self.pca.fit(to.train.conts)
        pca = pd.DataFrame(self.pca.transform(to.conts))
        pca.columns = [f'pca_{i+1}' for i in range(self.n_comps)]

        for col in pca.columns:
            to.items[col] = pca[col].values.astype('float32')

        if self.add_col:
            for i in range(self.n_comps):
                if f'pca_{i+1}' not in to.cont_names: to.cont_names.append(f'pca_{i+1}')

        return self(to)

matdmiller · March 29, 2022, 7:53am

I’ll preface this by stating that I have not worked on any projects with tabular data or PCA, so I may not be interpreting your question correctly, but I believe what you’re after are the embeddings for the categorical variables. I believe these are stored within the model and looked up during the forward pass of the model, and are not part of the dataloader. You should be able to access them from the model in your learner in the embeds property.

github.com

fastai/fastai/blob/f91e058f500fdcebb9af74654bf14a2edc430cc0/fastai/tabular/model.py#L28

      
        
                n_cat = len(classes[n])
                sz = sz_dict.get(n, int(emb_sz_rule(n_cat)))  # rule of thumb
                return n_cat,sz
            
            
# Cell
            def get_emb_sz(to, sz_dict=None):
                "Get default embedding size from `TabularPreprocessor` `proc` or the ones in `sz_dict`"
                return [_one_emb_sz(to.classes, n, sz_dict) for n in to.cat_names]
            
            
# Cell
            class TabularModel(Module):
                "Basic model for tabular data."
                def __init__(self, emb_szs, n_cont, out_sz, layers, ps=None, embed_p=0.,
                             y_range=None, use_bn=True, bn_final=False, bn_cont=True, act_cls=nn.ReLU(inplace=True),
                             lin_first=True):
                    ps = ifnone(ps, [0]*len(layers))
                    if not is_listy(ps): ps = [ps]*len(layers)
                    self.embeds = nn.ModuleList([Embedding(ni, nf) for ni,nf in emb_szs])
                    self.emb_drop = nn.Dropout(embed_p)
                    self.bn_cont = nn.BatchNorm1d(n_cont) if bn_cont else None
                    n_emb = sum(e.embedding_dim for e in self.embeds)

Jehalladay · March 30, 2022, 1:46am

Ah, I see, that’s exactly what I needed to know. Unfortunately, we need the encodings before they reach the model to perform feature extraction, but since I now know it isn’t possible, I think we could just create our own solely for the purpose of feature extraction.

By feature extraction, I just mean extracting values from the dataset to augment each sample of data. This can be simple like averaging 2 values from that sample, or building a model based off the samples and having it produce an output for each sample. We run into the issue of data leakage though if we train this on our whole dataset. So, to combat this, so we only train on our train datasets and then use that model to produce new columns for each sample in the entire dataset. This is a really common practice for tabular data, although it may be more correct to say feature engineering, but I am not sure.

In my experiments though, I typically get better results when I extract the first 5 or 6 principal components to augment my dataset, this gives the DNN more features to chew on so it reduces overfitting. I’ll try and prepare some side by side results sometime to show the difference.

matdmiller · March 31, 2022, 5:33pm

Why do you need the embeddings to do PCA? The dataloader will give you the value index of the categorical variables. The embeddings are random until the model is trained so I’m not seeing why you want them before the model is trained. Once the model is trained you can pull them out of the model and reference them using the value indexes for the categorical values from the dataloader. It is important to have a validation set when training that is separate from the training dataset, but I’m not clear on why you are adding model outputs a new column(s) to your entire dataset (and it sounds like you’re then using it as the validation/test set)? This workflow does not sound familiar from the fastai lessons on Tabular data.

Have you checked out the fast.ai tablular lesson? Fast AI Video Viewer

Jehalladay · April 3, 2022, 7:05am

Well before you answered, I was under the assumption categorify one-hot encoded categorical variables, but it seems we train an embedding instead. The available values for the categorical values are numeric, and I think they were mapped to integer values, but a more accurate set of components can be generated with their one-hot encoded counterparts. We have a workaround though, we just one-hot encoded the integer mapped values inside the proc and discarded the encoding afterwards.

Extracting the first n principal components is pretty standard when training shallow machine learning models on tabular data, as it’s a method of convenient dimensional reduction as well as a good method to create new columns of data from the existing ones. These new columns can provide a better observation to the classification model and increase accuracy/other key metrics.

I would do this before I ever pass the data frame to the dataloader, but I only want to train the model on the training data, however I want to use the model to generate features for the entire dataset. This just prevents data leakage, because if we used the entire dataset to train the pca and then generated the new components, the training set would have data from a model trained on the validation set, possibly giving the impression of better performance which won’t transfer to real world deployment.

These features are just added as new columns so ow the model will see x+n features for each observation instead of x features. Sometimes this improves the capability of the model being trained and sometimes it doesn’t, but it’s a very valuable tool to improve a model’s performance.

As for your question, no, I’m new to this community so I haven’t watched the lessons, but I am very familiar with like 60-70% of the api. We have just been using fastai for the last 3 years or so in our research lab because of the deployment speed and easy scheduling. I thought it was odd there weren’t any feature engineering tools built specifically for the dataloader pipeline since it is pretty standard in scikitlearn and other frameworks, so we have been working on building our own. Those videos look interesting though, I will make sure to watch them.

We also use the dataloader pipeline for passing train and test splits to other models like XGboost and random forests as well as ensembling, so being able to put these features into the pipeline simplifies our experiment flow.