Mixed Multi-Text + Numerical DataBunch


(Guido Tapia) #1

I have a project where I have a single text vocab but multiple sentences per sample. I.e. I get 2 parts of a conversation (question, reply, also names). I also have some numerical data (age, for instance). So I want to build a Classifier that uses a language model encoder to encode the 2 sentences and adds the numerical information.

This is simple in plain pytorch, just get the embeddings from the language model encoder (one text part at a time), numericals and concatenate them and pass to a dense head for classification.

I’m however trying to do this the ‘fastai’ way and want to create a databunch, none of the text databunches support multi text columns (they just concate name then into a single list which is bad for me). And the data_blocks api does not appear to allow me to create a mixed mode databunch.

Does anyone have any example of site I can look at to see how to implement a mixed text (multi text cols) + numerical model? It would be something like the Tabular databunch/models that handle categoricals and numericals.

Tnx


#2

You will have to build your custom ItemList for this. There is a tutorial here and a great blog post there.


(Guido Tapia) #3

Thanks @sgugger, looking into this now. The first tutorial link is broken but I assume its this one:
https://docs.fast.ai/tutorial.itemlist.html

In any case, thanks for the info, looks like its exactly what I’m after.


(Guido Tapia) #4

No gonna lie, I’m finding building a mixed mode item list that has some text columns, incredibly difficult. By default all columns are preprocessed using a single set of pre-processors, so I assume I need to overwrite the process function. Ok, did that, then everything fails adding the test set. And basically each error leads back to some convoluted internal temporal or weird dependency, i.e. what is xtra for instance and why is my item list being constructed with this parameter magically when calling add_test?. Why is my item being created with a numpy array, when I’m passing a pandas Series into the ctor, where did that get intercepted. Who is actually calling the process function (search does not help due to convoluted inheritance hierarchy).

Anyways, I think I will not do this, just create a DataSet / Loader and use fastai for the training loop. Which is a shame as now I have to worry about all the NLP boilerplate (tokenisation, numerisation, etc, etc which has to interact nicely with the fastai language model vocab/encoder).

Thanks for the help regardless.