Mixed Multi-Text + Numerical DataBunch

I have a project where I have a single text vocab but multiple sentences per sample. I.e. I get 2 parts of a conversation (question, reply, also names). I also have some numerical data (age, for instance). So I want to build a Classifier that uses a language model encoder to encode the 2 sentences and adds the numerical information.

This is simple in plain pytorch, just get the embeddings from the language model encoder (one text part at a time), numericals and concatenate them and pass to a dense head for classification.

I’m however trying to do this the ‘fastai’ way and want to create a databunch, none of the text databunches support multi text columns (they just concate name then into a single list which is bad for me). And the data_blocks api does not appear to allow me to create a mixed mode databunch.

Does anyone have any example of site I can look at to see how to implement a mixed text (multi text cols) + numerical model? It would be something like the Tabular databunch/models that handle categoricals and numericals.

Tnx

2 Likes

You will have to build your custom ItemList for this. There is a tutorial here and a great blog post there.

1 Like

Thanks @sgugger, looking into this now. The first tutorial link is broken but I assume its this one:
https://docs.fast.ai/tutorial.itemlist.html

In any case, thanks for the info, looks like its exactly what I’m after.

2 Likes

No gonna lie, I’m finding building a mixed mode item list that has some text columns, incredibly difficult. By default all columns are preprocessed using a single set of pre-processors, so I assume I need to overwrite the process function. Ok, did that, then everything fails adding the test set. And basically each error leads back to some convoluted internal temporal or weird dependency, i.e. what is xtra for instance and why is my item list being constructed with this parameter magically when calling add_test?. Why is my item being created with a numpy array, when I’m passing a pandas Series into the ctor, where did that get intercepted. Who is actually calling the process function (search does not help due to convoluted inheritance hierarchy).

Anyways, I think I will not do this, just create a DataSet / Loader and use fastai for the training loop. Which is a shame as now I have to worry about all the NLP boilerplate (tokenisation, numerisation, etc, etc which has to interact nicely with the fastai language model vocab/encoder).

Thanks for the help regardless.

Hi. Did you ever figure this out? I’m hitting the same problem with ‘xtra’. Thanks.

Hi there. I hope I am not too late but I figured out a way to combine tabular and text data and train them together (though it only works with one text column. You can work around this by combine all texts into one) Build mixed databunch and train end-to-end model for Tabular (categorical + continuous data) and Text data

2 Likes

See my article here: