Build mixed databunch and train end-to-end model for Tabular (categorical + continuous data) and Text data

(Quan Tran) #1

Hi everyone,

So I have been searching for a way to combine and train tabular + text data with all the good stuff from fastai (databunch api, 1 cycle training, callbacks …) to do some Kaggle competitions and came across this awesome blog post by @wgpubs in which he created a tabular + text databunch using databunch API. I changed few line of codes to get it to work on fastai 1.0.51+ and then built a custom model to combine RNN/LSTM and MLP (neural net layers) for end-to-end training.

Tested on Kaggle Mercari competition, it trained successfully and loss did go down but very slowly and the overall result is kinda underwhelming (middle of the LB). I am planning to debug this during next week.

Anyway, I think it works and since there are lots of folks asking for a mixed databunch + model, I hope this can be a starting point. Here is the code and notebook:

I’d love if someone can test it on other dataset to see whether it’s effective to combine the model the way I did. Thanks!


Combining text and numerical data in ULMFiT
Mixed Multi-Text + Numerical DataBunch
Regression using Fine-tuned Language Model
(Andreas Daiminger) #2

Hi @quan.tran
Thanks a lot for sharing this. I am currently working on a model that extends ULMFiT to take numerical input features into account. I am a machine learning engineer at and I am working on a product to classify customer support tickets based on text + metadata (things like warehouse status of the order). I am absolutely sure that only the combination of these two feature types can lead to creating a good model. So it is a very good test case.
I can not share the data with you :-1: . But I will let you know how it goes. :slight_smile:


Combining text and numerical data in ULMFiT
(Matthew Teschke) #3

Thanks for sharing your work!

Did you check to compare if combining the text + tabular improves upon either one just by themselves? i.e., does the combined model improve upon just the text data?


(Andreas Daiminger) #4

I think it’s key to have a good test dataset. This is why I am excited about trying it out.
They way my dataset is structured it is absolutely clear that only the combination of tex and numerical features can lead to a good model.


(Quan Tran) #5

So I had few hours to spend today and run a bunch of different models on this Mercari dataset. Here’s what I found:
(all models are run on 2% of dataset for experiment)

  • Only tabular model: .562 on val set

  • Only text model: .599 on val set (with gradually unfreezing training)
    Model is massively underfitting, and I already try to increase model complexity with 1 hidden layer size 1000. With >1 hidden layers for ULMFIT text regression, there’s a bug in the fastai code and my custom fix does not make model learn anything, so I will come back to this later when I have time)

  • Tabular + Text model: .557 on val set (with gradually unfreezing)

So, combining tabular+text model is the best option, but it’s still very average on the competition LB. Tabular model on 100% dataset only yields .548 on the LB (kaggle link for kernel)

I am thinking of analyzing outputs of the tabular model outputs and text outputs so I can add some weights before combining them.

I also try out @joshfp model from this notebook ( and it achieves .543 on val set with gradually freezing (with only 2% dataset!). I will probably use/modify his code instead of mine since it runs waaay faster and achieve better score

1 Like

(Andreas Daiminger) #6

Hey @quan.tran!

Just to let you know. I tried out your implementation and it improved my validation accuracy substantially. I trained on text + 1 categorical feature. I went from 40% acc with the only text Model to 45% acc with the text + tabular model.
I will have more categorical features available for training soon and I am confident that the model will further improve with more categorical input data. I will let you know how it goes.

Update: I added more meta data to the model (categorical and continuous) and could get a even better result. 62% acc.

I have detected that learn.export() doesn’t work with your implementation I also had problems to make learn.predict() work. If you like we can team up and I can work on making your implementation production ready. I could also take a look at how we can compile the trained model with PyTorch.jit


(Quan Tran) #7

That sounds great! I’d love to collaborate on this. There are still few things to improve though: the export and predict functions, my code is still running really slow compared to the other model and I leave out a small portion of ULMFIT model (the SortishSampler). I will come back to this soon.

This is the other (faster) model implementation I was talking about:
It does not require writing a new ItemLists and seems to be better in general. Maybe I will rewrite mine using this one.

1 Like

(Andreas Daiminger) #8

Wow. Version 2 runs much faster. Before it took around 28 min to train one epoch of the fully unfrozen model and now it takes only around 10 min!! I used a slightly different model with more fully connected layers on top. And my problem is classification so I used CE and softmax.
Here is my model head before the softmax layer:

(layers): Sequential(
    (0): BatchNorm1d(800, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (1): Dropout(p=0.5)
    (2): Linear(in_features=800, out_features=400, bias=True)
    (3): ReLU(inplace)
    (4): BatchNorm1d(400, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (5): Dropout(p=0.4)
    (6): Linear(in_features=400, out_features=200, bias=True)
    (7): ReLU(inplace)
    (8): BatchNorm1d(200, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (9): Dropout(p=0.1)
    (10): Linear(in_features=200, out_features=23, bias=True)

I noticed that some of the fastai databunch default funcionality is getting lost. But it shouldn’t be too hard to add stuff to print out data summaries and add a property.

I hope I have time this week to look into extending your code and make model.predict and model.export work.


(Andreas Daiminger) #9

Hey @quan.tran! It’s me again!
I did some minor tweaks to your code to make TabularTextProcessor.process_one work.
Do you want me to submit a pull request so you cab review the code?

1 Like

(Quan Tran) #10

Thanks for the help! I have added it to the repo

1 Like

(Andreas Daiminger) #11

I can confirn [this approach]( ) works better.
It backrops over a entire tabular model before concatenating the output of the text and tabular model.

Where does this implementation come from? Did you do this as well?


(Quan Tran) #12

The implementation came from this: which is from the ‘share your work’ thread in v1 I believe. Somehow combining from 2 different learner (text learner and tabular learner) is better than writing one new learner. I will probably look into it a bit more as I have more free time next week.



Hi, silly question, I’m a bit new to this API. How do I train a classification model instead of a regression model using your code? Can’t seem to figure it out, I’ve tried changing the label_cls to CategoryList and using CrossEntropyFlat but to no avail.