Questions with tabular data - overfitting? embeddings?, kaggle titanic dataset

Puurun · April 19, 2022, 10:19am

Now I read notebook 9 from the fastai course, and I went to try out what I’ve learned. And I’ve ran into a few problems.

The full notebook is here by the way.

I have some questions that I can’t answer by myself. It would be very grateful if anyone helps…

When I trained my tabular neural network model with the default layer size (200, 100), the network seemed to overfit. I know this because when training results with the predictions in test set, I found out that the accuracy is very low. ( < 65%) (A simple random forest model has an accuracy of about 77%)
So I tried using a smaller model (20, 10) because the titanic dataset is a rather small dataset with about 800 rows.(to avoid memorizing the dataset) And the accuracy went up to about 73%.

But when I look at the loss of the training and validation sets, there’s little sign of overfitting.
I’ve learned that when the validation loss increases while the training loss decreases, that is a sign of overfitting. But even when the model’s validation loss was decreasing the whole time, the predictions of the test set was very disappointing.

I’m not sure how to interpret this. Is this overfitting? Or could there be another problem?

I tried using the embeddings learned from the neural network to the random forest. I added extra columns for each embedding and dropped the original value. But the random forest trained with the embeddings perform very poorly. When I looked at feature importance in the random forest, the embedding’s importances was very low. Sex was a important factor in the original dataset, but in the embedding dataset, sex was a very unimportant factor.

I’m not sure if it’s because the neural network overfit, or if the dataset was to small, or if my implementation was very poor…

The third one isn’t that important, but the model that I’ve tried to implement by reading the source code in the fastai docs is a lot slower compared to the model that fastai gives us.
The only big difference that I could find was that I was using for statements instead of list comprehensions. Is using for statements a very unefficient way of looping through?

class TabModel(Module):
    def __init__(self, emb_sz, cat_n, cont_n, layers):
        super(TabModel, self).__init__()
        self.cat_n = cat_n
        self.cont_n = cont_n
        self.embeddings = nn.ModuleList(nn.Embedding(ni, nf) for ni, nf in emb_sz)
        cat_len = sum([x for _, x in emb_sz])
        self.emb_drop = nn.Dropout(0.1)
        self.linear_drop = nn.Dropout(0.5)
        model = []
        model.append(nn.BatchNorm1d(cat_len+cont_n))
        model.append(self.linear_drop)
        model.append(nn.Linear(cat_len+cont_n, layers[0]))
        model.append(nn.Mish())
        for i in range(len(layers)-2):
            model.append(nn.BatchNorm1d(layers[i]))
            model.append(self.linear_drop)
            model.append(nn.Linear(layers[i], layers[i+1]))
            model.append(nn.Mish())
        self.linear = nn.Sequential(*model)
        self.final_layer = nn.Linear(layers[-2], layers[-1])
        
    def forward(self, cat, cont):
        emb = []
        for i in range(self.cat_n):
            emb.append(self.embeddings[i](cat[:,i].int()))
        emb.append(cont)
        input = torch.cat(emb, dim=1).float()
        res = self.linear(input)
        res = self.final_layer(res)
        return res```