Doubts regarding creating a DataBunch for a combination of multiple models

I have a dataset which contains multiple columns of categorical values and a column of text data.

I was thinking of passing the text data to an RNN/LSTM model similar to IMDB example and categorical values through NN model similar to Rossaman notebook and then combining the outputs from those and pass through a Neural Network.

I have doubts regarding creating the DataBunch. How should I move? What are the things I should try?

3 Likes

Thanks to @wgpubs for this article and accompanying code show how to create a MixedTabularLine. This should help those who want both categorical/continuous and text data in the same databunch get started. I have only skimmed the article and have not started on the code, will do so next week. But wanted to express gratitude and put up a link here for others.

Thanks @wgpubs for your contribution!

3 Likes

Thanks @shaun1 for sharing this article and code.

Thank you @wgpubs, for sharing writing such a great blog post. Could you also give the complete example how to use this dataset for training using databunch.

The github repo is linked to in the article.

Thanks! Glad it was helpful.

Your use case is exactly what I was thinking of when I put the demo together.

2 Likes

It’s an awesome example. Well done!

Yes, excellent work! May I ask whether the code in the github repo should still work with later versions of the fastai library (I’m using 1.0.46)? I’m getting an unexpected argument error (for ‘xtra’) in tabular/data.py.

I’m not experienced enough yet to debug this with confidence!

Any pointers gratefully received.

Thanks.

You replace all the xtra by inner_df, that parameter was renamed to be clearer to follow. @wgpubs you might want to update your repo if you want to be compatible with v1.0.46 and later.

Ah, yes. Thanks so much :slight_smile:

Thanks for the heads up.

I’m in the midst of writing up and coding a part 2 … everything will be updated to >=1.0.46

3 Likes

Thanks for the sample, it helps a lot. But when I try to use some self defined tokenizer, like bert tokenizer, it seems like the function process and process_one are not working for the text cols. Could you help with that? Thanks

I haven’t looked at this in so long I feel I can’t be much help right now. There are some articles out there re: integrating various Huggingface tokenizers that may be helpful for you in building a fastai v.1 friendly implementation.

If you get something figured out, send me a PR and I’d be glad to update my repo.

Thanks -wg