Success with categorical entity embeddings?

Has anyone had success using embeddings for categorical variables (such as rossman notebook, lesson 14)?

I have been testing the approach on a predictive problem at work with 56 continuous and 8 categorical variables (with a total of 68 categories), but haven’t seen any improvement over just using XGBoost with one hot encoded variables. I’ve tested a ton of different architectures (# layers, dropout %, BN) and the results all tend to be the same.

My theory is that the continuous variables in my problem are providing most of the predictive power and that the slight reduction in dimensions for my categorical variables doesn’t do much. I was a little discouraged that this approach didn’t work here, so was hoping other people might have other success stories.

Hey Phil,

I have nothing concrete to offer here other than Lesson 14 got me excited to try this approach on high dimensional categorical variables for use in downstream (linear) models. Have you looked at the variable importance rankings from xgboost to test your theory? One thing to note is that the xgboost algorithm tends to favor variables with lots of split points so I think it’s importance rankings are somewhat susceptible to that weakness. Another thing to note is that tree-based algorithms don’t actually need categorical variables to be one-hot encoded so long as a sufficient number of trees are grown of sufficient depth. Of course, they still need to be integerized. I’d have to think about it a bit more, but perhaps that gives us a clue of why entity embeddings aren’t all that useful for downstream tree-based algorithms in your case. Although, I seem to recall the Rossman 3rd place winners published some metrics in their paper to the contrary. It might be worth building a few quick and dirty models of another type (lasso, random forest?) to see if the embeddings offer improvement.

Thanks,
Patrick

I actually tested the xgboost model without only the continuous variables and it performed the same, so I’m thinking its just a case where the categorical variables don’t add much value. I should have checked that from the get go (like you mentioned with the variable importance), but I got caught up in wanting to try the approach.

Maybe you should mix with Deep and Wide together

How wilL Deep and Wide help here?

Same issue with me. Doing a regression problem at work. NN ala Rossman got RMSE to about .40, xgboost got down to .35. Ever have any more luck on your end?

I realized that on the problem I was working on, I had already encoded a lot of the categorical information into other variables.

For example, I made a variable that was value of last year’s continuous variable I was forecasting (positive real numbers), and zero otherwise. Xgboost was easily able to segment out holiday or not effects there as well as bigger/smaller holidays based on the value. So when I went to use embeddings for holidays in my model, I already was capturing most of the information already, so it didn’t help.

I had similar variables for special events and day of week, so at the end of the day, I think I had already done enough feature engineering to capture what the embeddings were meant to.

Got it. Yea, sounds like same situation as me. Thanks for the info!

Hi, I have been working on some classification data sets from kaggle, but my Structured learner model seems to be not performing as expected. Since I was more focused on the performance of the model, i used to borrow the feature engineering parts from an existing kernel in a kaggle competition and work on it.

But Random forest and Xgboost always tends to come in front of the ranks in slight margin, compared to the columnar nn module. For example: In 1.) Titanic kaggle, Xgboost =0.82 random forest =0.81, Structured learner =0.80, 2.) West Nile Virus Prediction kaggle, Xgboost =0.73 random forest =0.72, Structured learner =0.69

Please find the code used.

class MixedInputModel(nn.Module):
def init(self, emb_szs, n_cont, emb_drop, out_sz, szs, drops,
y_range=None, use_bn=False):
super().init()
self.embs = nn.ModuleList([nn.Embedding(c, s) for c,s in emb_szs])
for emb in self.embs: emb_init(emb)
n_emb = sum(e.embedding_dim for e in self.embs)
self.n_emb, self.n_cont=n_emb, n_cont

    szs = [n_emb+n_cont] + szs
    self.lins = nn.ModuleList([
        nn.Linear(szs[i], szs[i+1]) for i in range(len(szs)-1)])
    self.bns = nn.ModuleList([
        nn.BatchNorm1d(sz) for sz in szs[1:]])
    for o in self.lins: kaiming_normal(o.weight.data)
    self.outp = nn.Linear(szs[-1], out_sz)
    kaiming_normal(self.outp.weight.data)

    self.emb_drop = nn.Dropout(emb_drop)
    self.drops = nn.ModuleList([nn.Dropout(drop) for drop in drops])
    self.bn = nn.BatchNorm1d(n_cont)
    self.use_bn,self.y_range = use_bn,y_range

def forward(self, x_cat, x_cont):
    if self.n_emb != 0:
        x = [e(x_cat[:,i]) for i,e in enumerate(self.embs)]
        x = torch.cat(x, 1)
        x = self.emb_drop(x)
    if self.n_cont != 0:
        x2 = self.bn(x_cont)
        x = torch.cat([x, x2], 1) if self.n_emb != 0 else x2
    for l,d,b in zip(self.lins, self.drops, self.bns):
        x = F.relu(l(x))
        if self.use_bn: x = b(x)
        x = d(x)
    x = self.outp(x)
    if self.y_range:
        x = F.sigmoid(x)
        x = x*(self.y_range[1] - self.y_range[0])
        x = x+self.y_range[0]
    return x

md = ColumnarModelData.from_data_frames(’/tmp’, trn_df, val_df, trn_y[0].astype(np.int64), val_y[0].astype(np.int64), cats, 5, test_df=df_test)
model = MixedInputModel(emb_szs, n_cont=len(df.columns)-len(cats), emb_drop=0, out_sz=3, szs=[500], drops=[0.5],use_bn=True).cuda()
bm = BasicModel(model, ‘binary_classifier’)

    class StructuredLearner(Learner):
        def __init__(self, data, models, **kwargs):
            super().__init__(data, models, **kwargs)
            self.crit = F.mse_loss

    learn = StructuredLearner(md, bm)
    learn.crit = F.binary_cross_entropy

Is there any method that i can implement on my model to make it more efficient in generating the results? I am not able to figure out if this is happening because of any glitch in the input parameters given or is it due to the predictive power caused between continuous and categorical features.

1 Like