Optimizing Tabular Data vs. LightGBM

mindtrinket · April 21, 2019, 7:23pm

For the Santander Customer Transaction Prediction, congrats to @fl2o who took 1st place!

To look back at this it appears that a NN with some LightGBM blending won the day! However, they used LightGBM all the way to the end to determine feature importance.

I’m still pouring through the code here and will post a jupityr notebook on it shortly.
https://www.kaggle.com/fl2ooo/nn-wo-pseudo-1-fold-seed

Meditation · December 28, 2019, 11:16am

The creator of LightGBM introduced DeepGBM, categorical features embedding, and with numerical features feed into NN, very similar to Fastai tabular, very interesting. @muellerzr

Jdemlow · October 29, 2020, 1:56pm

James, I am back at it again learning the Fastai Library V2 and my work is finally giving us a GPU to work so it’s time for me to at least use DL as an ensemble to other algo for tabular data. I was wondering if you had an example of something like this that allows Fastai V2 to use the custom model.

I am going to re-read over the fastai book in the tabular section because i believe it’s there and that would mean that just some of the code in this custom model might need to change.

i.e bn_drop_lin in favor of the LinBnDrop

class LinBnDrop(nn.Sequential):
    "Module grouping `BatchNorm1d`, `Dropout` and `Linear` layers"
    def __init__(self, n_in, n_out, bn=True, p=0., act=None, lin_first=False):
        layers = [BatchNorm(n_out if lin_first else n_in, ndim=1)] if bn else []
        if p != 0: layers.append(nn.Dropout(p))
        lin = [nn.Linear(n_in, n_out, bias=not bn)]
        if act is not None: lin.append(act)
        layers = lin+layers if lin_first else layers+lin
        super().__init__(*layers)


def bn_drop_lin(n_in:int, n_out:int, bn:bool=True, p:float=0., actn:Optional[nn.Module]=None):
    "Sequence of batchnorm (if `bn`), dropout (with `p`) and linear (`n_in`,`n_out`) layers followed by `actn`."
    layers = [nn.BatchNorm1d(n_in)] if bn else []
    if p != 0: layers.append(nn.Dropout(p))
    layers.append(nn.Linear(n_in, n_out))
    if actn is not None: layers.append(actn)
    return layers

Anyways happy learning everyone.

muellerzr · October 29, 2020, 2:05pm

You are following all the right steps, so it’s likely something with the model itself. To check, you should grab a batch of data and feed it to the model before training. (I do this all the time with custom model scenarios, avoids headache). See below for an example with a tabular model:

batch = next(iter(learn.dls[0])) # batch of data from train
with torch.no_grad():
  learn.model.eval()
  learn.model.cuda()
  out = learn.model(*batch[:2])

Jdemlow · October 29, 2020, 2:20pm

Funny enough just came across your kaggle post and I was thinking about going through it adjusting to my data set and then allowing myself to comment as I type out the code. I am a huge fan of this library and now have the ability to fully use it again. When skimming https://www.kaggle.com/muellerzr/fastai-v2-starter-code did you write out the source code to give insight what is happening inside of fastai?

I will most def share what I learn here so those that are in my shoes can avoid the headaches.

You can totally ignore this as this will be the next thing I do, but I know you are extremely active in fastai community and have a passion for tabular data I follow you on twitter
One question I do have do you have a good example of the productionalization work flow. One thing that I do with my data from my past learnings from Jeremy Howard (Not tagging him because he doesn’t need to be here), but in fastai it isn’t clear how to go about this as it’s not feasible (actually not ideal)
to have the training data inside of the production environment when it’s in a API call.

class Normalize:
    """
    Normalizes all numeric data columns in a pandas Dataframe
    """
    @staticmethod
    def apply_train(df, cont_vars):
        """Computer the means and stds of cont_name columns to normalize them"""   # noqa:
        means, stds = {}, {}
        for n in cont_vars:
            assert types.is_numeric_dtype(df[n]), f"""Can't normalize '{n}' column as it isn't numerical."""  # noqa:
            means[n], stds[n] = df[n].mean(), df[n].std()
            df[n] = (df[n] - means[n]) / (1e-7 + stds[n])
        return df, means, stds

    @staticmethod
    def apply_test(df, means, stds, cont_vars):
        """Normalize cont_vars with the same statistics in apply_train"""
        for n in cont_vars:
            df[n] = round((df[n] - means[n]) /
                          (1e-7 + stds[n]), 7).astype('float32')

How I maintain the right level of data integrity to the data is I apply what I used for the training and validation set.


df, means, stds = normalize.apply_train(df, cont_vars)

# sending means, stds, etc to Azure blob and then pull them and inference time
normalize.apply_test(df, means, stds, cont_vars)

With TabularPandas it isn’t clear how to save the metrics, but that might get resolved in the export of the model, but I have to be thinking about this as I go through this.

Thanks a million of all your hard work i know the community is thankful

muellerzr · October 29, 2020, 2:24pm

Fun fact, wrote some walk with fastai articles on just this

Exporting your tabular pandas object: Exporting `TabularPandas` for Inference (Intermediate) | walkwithfastai

Using custom statistics: Using Custom Transform Statistics (Intermediate) | walkwithfastai

More is to come, just need some time after midterms to actually make the rest of what I have planned happen.

Oh goodness no. I never ever do that for general debugging. Too much time wasted trying to reinvent the wheel. I look at the last 3 blocks from the stack trace and that can paint a pretty good picture. In this case pandas was using too much memory which gave me OOM errors. Checking the init for Tabular showed that fastai would override some datatypes, so there was an interim PR (at the time) which fixed that some.

Jdemlow · October 29, 2020, 2:28pm

Perfect, will take a look at that and I will def send you anything I end up using so that you can give back to the community. I am so about giving back to this community in anyway that I can.

I will most def be using this moving forward so thank you. I can see there is something wrong here I wanted to compare the standard Tabular approach in fastai to this approach and then really dig into the differences between them.

Seems like some adjustments are needed to be made to the structure of the model as it was written 2 years ago.

For anyone wondering what model I am talking about kaggle post slight changes were made already as some of the functions are deprecated ie embedding --> Embedding simple change, but it’s a change fastai2

# This is the NN structure, starting from fast.ai TabularModel.
class my_TabularModel(nn.Module):
    "Basic model for tabular data."
    def __init__(self, emb_szs, n_cont, out_sz, layers, ps=None,
                 emb_drop:float=0., y_range=None, use_bn:bool=True, bn_final:bool=False, 
                 cont_emb=2, cont_emb_notu=2):
        
        super().__init__()
        # "Continuous embedding NN for raw features"
        self.cont_emb = cont_emb[1]
        self.cont_emb_l = torch.nn.Linear(1 + 2, cont_emb[0])
        self.cont_emb_l2 = torch.nn.Linear(cont_emb[0], cont_emb[1])
        
        # "Continuous embedding NN for "not unique" features". cf #1 solution post
        self.cont_emb_notu_l = torch.nn.Linear(1 + 2, cont_emb_notu[0])
        self.cont_emb_notu_l2 = torch.nn.Linear(cont_emb_notu[0], cont_emb_notu[1])
        self.cont_emb_notu = cont_emb_notu[1]
            
        ps = ifnone(ps, [0]*len(layers))
        ps = listify(ps)*len(layers)
        
        # Embedding for "has one" categorical features, cf #1 solution post
        self.embeds = Embedding(emb_szs[0][0], emb_szs[0][1])
        
        # At first we included information about the variable being processed (to extract feature importance). 
        # It works better using a constant feat (kind of intercept)
        self.embeds_feat = Embedding(201, 2)
        self.embeds_feat_w = Embedding(201, 2)
        
        self.emb_drop = nn.Dropout(emb_drop)
        
        n_emb = self.embeds.embedding_dim
        n_emb_feat = self.embeds_feat.embedding_dim
        n_emb_feat_w = self.embeds_feat_w.embedding_dim
        
        self.n_emb, self.n_emb_feat, self.n_emb_feat_w, self.n_cont,self.y_range = n_emb, n_emb_feat, n_emb_feat_w, n_cont, y_range
        
        sizes = self.get_sizes(layers, out_sz)
        actns = [nn.ReLU(inplace=True)] * (len(sizes)-2) + [None]
        layers = []
        # Typically the acts gives us the ability to add a RELU(Inplace)
#         for i,(n_in,n_out,dp) in enumerate(zip(sizes[:-1],sizes[1:],[0.]+ps)):
        for i,(n_in,n_out,dp,act) in enumerate(zip(sizes[:-1],sizes[1:],[0.]+ps,actns)):
            print(act)
            layers += bn_drop_lin(n_in, n_out, bn=use_bn and i!=0, p=dp, actn=act)
            
        self.layers = nn.Sequential(*layers)
        self.seq = nn.Sequential()
        
        # Input size for the NN that predicts weights
        inp_w = self.n_emb + self.n_emb_feat_w + self.cont_emb + self.cont_emb_notu
        # Input size for the final NN that predicts output
        inp_x = self.n_emb + self.cont_emb + self.cont_emb_notu
        
        # NN that predicts the weights
        self.weight = nn.Linear(inp_w, 5)
        self.weight2 = nn.Linear(5,1)
        
        mom = 0.1
        self.bn_cat = nn.BatchNorm1d(200, momentum=mom)
        self.bn_feat_emb = nn.BatchNorm1d(200, momentum=mom)
        self.bn_feat_w = nn.BatchNorm1d(200, momentum=mom)
        self.bn_raw = nn.BatchNorm1d(200, momentum=mom)
        self.bn_notu = nn.BatchNorm1d(200, momentum=mom)
        self.bn_w = nn.BatchNorm1d(inp_w, momentum=mom)
        self.bn = nn.BatchNorm1d(inp_x, momentum=mom)
        
    def get_sizes(self, layers, out_sz):
        return [self.n_emb + self.cont_emb_notu + self.cont_emb] + layers + [out_sz]

    def forward(self, x_cat:Tensor, x_cont:Tensor) -> Tensor:
        b_size = x_cont.size(0)
        
        # embedding of has one feat
        x = [self.embeds(x_cat[:,i]) for i in range(200)]
        x = torch.stack(x, dim=1)
        
        # embedding of intercept. It was embedding of feature id before
        x_feat_emb = self.embeds_feat(x_cat[:,200])
        x_feat_emb = torch.stack([x_feat_emb]*200, 1)
        x_feat_emb = self.bn_feat_emb(x_feat_emb)
        x_feat_w = self.embeds_feat_w(x_cat[:,200])
        x_feat_w = torch.stack([x_feat_w]*200, 1)
        
        # "continuous embedding" of raw features
        x_cont_raw = x_cont[:,:200].contiguous().view(-1, 1)
        x_cont_raw = torch.cat([x_cont_raw, x_feat_emb.view(-1, self.n_emb_feat)], 1)
        x_cont_raw = F.relu(self.cont_emb_l(x_cont_raw))
        x_cont_raw = self.cont_emb_l2(x_cont_raw)
        x_cont_raw = x_cont_raw.view(b_size, 200, self.cont_emb)
        
        # "continuous embedding" of not unique features
        x_cont_notu = x_cont[:,200:].contiguous().view(-1, 1)
        x_cont_notu = torch.cat([x_cont_notu, x_feat_emb.view(-1,self.n_emb_feat)], 1)
        x_cont_notu = F.relu(self.cont_emb_notu_l(x_cont_notu))
        x_cont_notu = self.cont_emb_notu_l2(x_cont_notu)
        x_cont_notu = x_cont_notu.view(b_size, 200, self.cont_emb_notu)

        x_cont_notu = self.bn_notu(x_cont_notu)
        x = self.bn_cat(x)
        x_cont_raw = self.bn_raw(x_cont_raw)

        x = self.emb_drop(x)
        x_cont_raw = self.emb_drop(x_cont_raw)
        x_cont_notu = self.emb_drop(x_cont_notu)
        x_feat_w = self.bn_feat_w(x_feat_w)
        
        # Predict a weight for each of the previous embeddings
        x_w = torch.cat([x.view(-1,self.n_emb),
                         x_feat_w.view(-1,self.n_emb_feat_w),
                         x_cont_raw.view(-1, self.cont_emb), 
                         x_cont_notu.view(-1, self.cont_emb_notu)], 1)

        x_w = self.bn_w(x_w)

        w = F.relu(self.weight(x_w))
        w = self.weight2(w).view(b_size, -1)
        w = torch.nn.functional.softmax(w, dim=-1).unsqueeze(-1)

        # weighted average of the differents embeddings using weights given by NN
        x = (w * x).sum(dim=1)
        x_cont_raw = (w * x_cont_raw).sum(dim=1)
        x_cont_notu = (w * x_cont_notu).sum(dim=1)
        
        # Use NN on the weighted average to predict final output
        x = torch.cat([x, x_cont_raw, x_cont_notu], 1) if self.n_emb != 0 else x_cont
        x = self.bn(x)
            
        x = self.seq(x)
        x = self.layers(x)

Jdemlow · November 2, 2020, 3:54pm

This isn’t me not trusting Fastai to give me the results, but I would like to be able to check the get_preds function and the indexs come out of order. Looked into your examples and they are with i guess an older version as order=True isn’t an options to get the validation set back out in order.

# BaseTabularModel is Fastai default just with my notes
model = BaseTabularModel(emb_szs, len(to.cont_names), out_sz=to.c, layers=[1000, 550], ps=[0.001,0.01], embed_p=0.3,
                         y_range=None, use_bn=True)
gc.collect()

opt_func = partial(Adam, wd=0.01, eps=1e-5)
learn = TabularLearner(dls, model, opt_func=opt_func, metrics=[accuracy, RocAucBinary(), BalancedAccuracy()])

# use your trick
batch = next(iter(learn.dls[0])) # batch of data from train
with torch.no_grad():
    learn.model.eval()
    learn.model.cuda()
    out = learn.model(*batch[:2])

learn.fit_one_cycle(1, 1e-3, wd=0.2)

learn.validate(dl = dls.valid)

inputs, probs, preds = learn.get_preds(with_input=True)

cm = confusion_matrix(to.valid.y, to_np(preds[:, 0]))
logger.info("Accuracy For Each Class")
logger.info(f'{cm.diagonal()/cm.sum(axis=1)}')
logger.info(f'{cm}')
logger.info(f'{classification_report(to.valid.y, to_np(preds[:, 0]))}')
fpr, tpr, thresholds = roc_curve(to.valid.y, to_np(probs[:, 1]))
logger.info(f'AUC {auc(fpr, tpr)}')

results
INFO:noshow.imports:Accuracy For Each Class
INFO:noshow.imports:[1. 1.]
INFO:noshow.imports:[[56642 0]
[ 0 4860]]
INFO:noshow.imports: precision recall f1-score support

       0       1.00      1.00      1.00     56642
       1       1.00      1.00      1.00      4860

accuracy                           1.00     61502

macro avg 1.00 1.00 1.00 61502
weighted avg 1.00 1.00 1.00 61502

INFO:noshow.imports:AUC 0.7524289058723166

It’s odd that

this is giving the following as i might have a variable inside of the dataframe that is giving data leakage, but the model is giving a perfect proportion of zero and ones for the classification. I am using a kaggle data set but it’s baby steps.

As much as i would love the model to be perfect it isn’t haha any idea here

Jdemlow · November 2, 2020, 7:58pm

I have tried using this as well and to be clear test is a pandas dataframe.

dl = learn.dls.test_dl(test, with_label=False)
learn.get_preds(dl=dl)

all of which gives me the same result that normalize isn’t callable and I am not sure why. I have tried the tabularpandas and the tabulardataloaders each giving the same issue any thoughts here.

muellerzr · November 2, 2020, 9:04pm

Not sure what’s going on with the Normalize bits. What version are you running of both fastai and fastcore? And are you using any external modules/sublibraries?

Jdemlow · November 2, 2020, 9:43pm

Yes, I have a custom Normalize function that is based off the fastai normalize function from version 1 and I have been using it for most of my projects at work. This one is one of those dang it!! Thank the lord for other people making you think.

Solution: Name your normalize Normalizer

This was an all-morning issue I had to go to the gym to regather myself and now this has worked S/O Zach for saving another fastai user!!!