Build mixed databunch and train end-to-end model for Tabular (categorical + continuous data) and Text data

quan.tran · April 7, 2019, 8:10pm

Hi everyone,

So I have been searching for a way to combine and train tabular + text data with all the good stuff from fastai (databunch api, 1 cycle training, callbacks …) to do some Kaggle competitions and came across this awesome blog post by @wgpubs in which he created a tabular + text databunch using databunch API. I changed few line of codes to get it to work on fastai 1.0.51+ and then built a custom model to combine RNN/LSTM and MLP (neural net layers) for end-to-end training.

Tested on Kaggle Mercari competition, it trained successfully and loss did go down but very slowly and the overall result is kinda underwhelming (middle of the LB). I am planning to debug this during next week.

Anyway, I think it works and since there are lots of folks asking for a mixed databunch + model, I hope this can be a starting point. Here is the code and notebook: https://github.com/anhquan0412/fastai-tabular-text-demo

I’d love if someone can test it on other dataset to see whether it’s effective to combine the model the way I did. Thanks!

Andreas_Daiminger · April 9, 2019, 8:45am

Hi @quan.tran
Thanks a lot for sharing this. I am currently working on a model that extends ULMFiT to take numerical input features into account. I am a machine learning engineer at reply.ai and I am working on a product to classify customer support tickets based on text + metadata (things like warehouse status of the order). I am absolutely sure that only the combination of these two feature types can lead to creating a good model. So it is a very good test case.
I can not share the data with you . But I will let you know how it goes.

Tchotchke · April 9, 2019, 2:19pm

Thanks for sharing your work!

Did you check to compare if combining the text + tabular improves upon either one just by themselves? i.e., does the combined model improve upon just the text data?

Andreas_Daiminger · April 9, 2019, 2:26pm

I think it’s key to have a good test dataset. This is why I am excited about trying it out.
They way my dataset is structured it is absolutely clear that only the combination of tex and numerical features can lead to a good model.

quan.tran · April 9, 2019, 10:10pm

So I had few hours to spend today and run a bunch of different models on this Mercari dataset. Here’s what I found:
(all models are run on 2% of dataset for experiment)

Only tabular model: .562 on val set
Only text model: .599 on val set (with gradually unfreezing training)
Model is massively underfitting, and I already try to increase model complexity with 1 hidden layer size 1000. With >1 hidden layers for ULMFIT text regression, there’s a bug in the fastai code and my custom fix does not make model learn anything, so I will come back to this later when I have time)
Tabular + Text model: .557 on val set (with gradually unfreezing)

So, combining tabular+text model is the best option, but it’s still very average on the competition LB. Tabular model on 100% dataset only yields .548 on the LB (kaggle link for kernel)

I am thinking of analyzing outputs of the tabular model outputs and text outputs so I can add some weights before combining them.

I also try out @joshfp model from this notebook (https://nbviewer.jupyter.org/gist/joshfp/b62b76eae95e6863cb511997b5a63118/5.full-deep-learning.ipynb) and it achieves .543 on val set with gradually freezing (with only 2% dataset!). I will probably use/modify his code instead of mine since it runs waaay faster and achieve better score

Andreas_Daiminger · April 20, 2019, 11:11am

Hey @quan.tran!

Just to let you know. I tried out your implementation and it improved my validation accuracy substantially. I trained on text + 1 categorical feature. I went from 40% acc with the only text Model to 45% acc with the text + tabular model.
I will have more categorical features available for training soon and I am confident that the model will further improve with more categorical input data. I will let you know how it goes.

@quan.tran
Update: I added more meta data to the model (categorical and continuous) and could get a even better result. 62% acc.

I have detected that learn.export() doesn’t work with your implementation I also had problems to make learn.predict() work. If you like we can team up and I can work on making your implementation production ready. I could also take a look at how we can compile the trained model with PyTorch.jit

quan.tran · April 21, 2019, 12:47am

That sounds great! I’d love to collaborate on this. There are still few things to improve though: the export and predict functions, my code is still running really slow compared to the other model and I leave out a small portion of ULMFIT model (the SortishSampler). I will come back to this soon.

This is the other (faster) model implementation I was talking about: https://github.com/anhquan0412/fastai-tabular-text-demo/blob/master/mercari-tabular-text-version-2-all.ipynb
It does not require writing a new ItemLists and seems to be better in general. Maybe I will rewrite mine using this one.

Andreas_Daiminger · April 22, 2019, 3:08pm

@quan.tran
Wow. Version 2 runs much faster. Before it took around 28 min to train one epoch of the fully unfrozen model and now it takes only around 10 min!! I used a slightly different model with more fully connected layers on top. And my problem is classification so I used CE and softmax.
Here is my model head before the softmax layer:

(layers): Sequential(
    (0): BatchNorm1d(800, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (1): Dropout(p=0.5)
    (2): Linear(in_features=800, out_features=400, bias=True)
    (3): ReLU(inplace)
    (4): BatchNorm1d(400, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (5): Dropout(p=0.4)
    (6): Linear(in_features=400, out_features=200, bias=True)
    (7): ReLU(inplace)
    (8): BatchNorm1d(200, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (9): Dropout(p=0.1)
    (10): Linear(in_features=200, out_features=23, bias=True)
  )

I noticed that some of the fastai databunch default funcionality is getting lost. But it shouldn’t be too hard to add stuff to print out data summaries and add a model.data.classes property.

I hope I have time this week to look into extending your code and make model.predict and model.export work.

Andreas_Daiminger · April 23, 2019, 9:11am

Hey @quan.tran! It’s me again!
I did some minor tweaks to your code to make TabularTextProcessor.process_one work.
Do you want me to submit a pull request so you cab review the code?

quan.tran · April 23, 2019, 6:32pm

Thanks for the help! I have added it to the repo

Andreas_Daiminger · May 16, 2019, 10:03am

I can confirn [this approach]( https://github.com/anhquan0412/fastai-tabular-text-demo/blob/master/mercari-tabular-text-version-2-all.ipynb ) works better.
It backrops over a entire tabular model before concatenating the output of the text and tabular model.

Where does this implementation come from? Did you do this as well?

quan.tran · May 18, 2019, 3:39am

The implementation came from this: https://nbviewer.jupyter.org/gist/joshfp/b62b76eae95e6863cb511997b5a63118/5.full-deep-learning.ipynb which is from the ‘share your work’ thread in v1 I believe. Somehow combining from 2 different learner (text learner and tabular learner) is better than writing one new learner. I will probably look into it a bit more as I have more free time next week.

Ayuei · May 23, 2019, 10:11am

Hi, silly question, I’m a bit new to this API. How do I train a classification model instead of a regression model using your code? Can’t seem to figure it out, I’ve tried changing the label_cls to CategoryList and using CrossEntropyFlat but to no avail.

quan.tran · May 28, 2019, 6:54pm

Hi @Ayuei

I just tried to do classification with the petfinder dataset and it is still working. Make sure you have the label type as integer train['target']= train['target'].astype(np.int8) . If you do this you don’t even need to worry about the label_cls or loss function because the fastai library will auto detect them. For reference, this is how I set the learner for tabular-text classification task

learn= tabtext_learner(data,AWD_LSTM,metrics=[accuracy],
                               callback_fns=[partial(SaveModelCallback, monitor='accuracy',mode='max',every='improvement',name='best_nn')],
                               **params)

Let me know if it helps!

Andreas_Daiminger · June 20, 2019, 1:38pm

Hey @quan.tran
How is it going? Are you still interested in working on this?
Currently it is not possible to make a prediction on a single data point with version 2. This means it is impossible to put the model in production. I am very interested in making this work!!!

chefkj · June 24, 2019, 9:58pm

So, I’ve got a bunch of pricing lists that I have to put into a specific format constantly. Would it be possible to use this to create a training dataset from the finalized version of formatting, to classify dep_var without matching label columns and predict cont and categorical variables based on the trained dataset?

quan.tran · June 28, 2019, 9:07pm

Hey @Andreas_Daiminger I was busy on other things and haven’t had a chance to look back at it. Do you have a list of functionality you want to have (beside the single data point prediction), as I have some time this weekend to play around with version 2 a bit (though I am not sure whether I’d be able to make v2 have all the functionalities as in v1 because v2 is fundamentally different)

quan.tran · June 28, 2019, 9:10pm

I am not sure what you mean by ‘to classify dep_var without matching label columns …’
Can you give me an example on what your dataset is like?

chefkj · June 28, 2019, 9:21pm

So essentially, I’ve got a database of 48,000 e-commerce items, vendors send me price sheets with updated pricing information on them, and none of the columns are ever labeled consistently. The only thing that is generally consistent is the SKU or Model number, which I have trained as the dep_var from the database. These SKU’s share the same row as the pricing information, but since the labels on columns can change it makes it a bit tricky. Any insights would be greatly appreciated!

Andreas_Daiminger · June 29, 2019, 10:20am

Hey @quan.tran!
I would like to use v2 in production. So everything related with that would be a top priority.
First single data point prediction and then model.export (difficult … I know!!)
I had a look myself, but could not come up with a simple way to make single data point prediction work. If you point in the right direction I can help you develop a solution.
Thanks for keeping interest!