Predict on new multi-text columns sentiment data

If I build a text classifier containing several text_cols, e.g.

data_clas = TextClasDataBunch.from_csv(path, ”data.csv”, text_cols=[”title”,”text”],label_cols=”target”,vocab=data_lm.train_ds.vocab, bs=32)

learn = text_classifier_learner(data_clas, drop_mult=0.5)
learn.load_encoder('ft_enc')

learn.fit_one_cycle(1, 1e-2)

How can I then use it to predict new previously unseen data? (I don’t have access to the data at the point in time when building the classifier). There is a learn.predict() but I don’t know how to use it if I have multiple text columns (title and text in the example above).

1 Like

So the best I came up with myself was to make a " ".join() on the title and the text on the new, previously unseen, data and use this as input to the predict()-method. If anyone have a better or cleaner way to this, please let me know.

brother predict take a lot of time i just put all my text into list around 10,000 of length and it take a lot
time as off typing this it did’t predict is your prediction is done and how much time it takes

I have the same issue a couple of years later now. For anyone reading this thread, I’m not confident that just joining the two text fields will provide the same result as I can see that the dataloaders insert xxfld 1 and xxfld 2 into the text to indicate the fields - but I don’t know how to do that in the prediction function.

For fastai==2.5.3, it looks like this is done in fastai.text.core:197 or thereabouts by _join_texts. It looks like it really does just add xxfld with an increasing index for each text column (ie xxfld 1, xxfld 2, etc) - so that’s what I’m going to do.

I’m hoping that these fields won’t be stripped out before numericalisation, I might double check by either comparing the numericalised inputs to the dataloader or by just checking that my performance matches expectations on the validation set.