ULMFit for Question and Answers


For a competition, I’m given a set of questions, a set of paragraphs (with associated meta data, such as from which chapter this paragraph is retrieved), and a smallish set of question/paragraph pairs that were handpicked as training data. My goal is to predict for all questions, the top 5 relevant paragraphs for that question (test set). The test set contains an entry for every possibility, so contains 260k entries, almost all of them should have a label 0 (meaning not in the top 5 of best matches).

As there are only 600 training pairs, the training set is highly imbalanced (as in: only label 1 entries). To correct that, I’ve added additional training pairs with label 0. These training pairs are generated by pairing the training question (and noting the chapter the trained paragraph is from) to additional paragraphs from different chapters. I’ve added training examples for label 0 in the order of 40x the amount of training examples for label 1.

For the next part, I’ve followed along with https://docs.fast.ai/text.html.

The main question, however, is: how do you structure the data, in order to make a valid classifier for a Question & Answer task?

My approach: add an additional column containing full_text = question +’ / ’ + paragraph (=answer).

data_lm = TextLMDataBunch.from_csv(path, 'train.csv', min_freq=1)
data_clas = TextClasDataBunch.from_csv(path, 'train.csv', vocab=data_lm.train_ds.vocab, bs=32)

learn = language_model_learner(data_lm, AWD_LSTM, drop_mult=0.5)
learn.fit_one_cycle(1, 1e-2)


learn.fit_one_cycle(1, 1e-3)


I think the language model learning works pretty well here.

    learn = text_classifier_learner(data_clas, AWD_LSTM, drop_mult=0.5)

learn.fit_one_cycle(1, 1e-2)


Also the classification learner does pretty well!

Now for the sad part:
Prediction on 260k entries takes forever (15+hours), and never classifies anything as label 1.


  • Is this a valid approach for constructing a classifier (ie adding question and answer, and labeling known good and bad pairs as 1 and 0)
  • Is there a way to speed up prediction? Currently I’m looping through the test set and calling learn.predict on every row.



Alright, I’ve made progress, but am still at the same (terrible) prediction level.

I’ve now been able to add the test set to the DataBunch as well, using:

    data_lm = TextLMDataBunch.from_csv(path, 'train.csv', valid_pct=0.2, min_freq=2)
    data_clas = TextClasDataBunch.from_csv(path, 'train.csv', test = 'test.csv',  header=None, vocab=data_lm.train_ds.vocab, bs=bs)

This speeds up predictions tremendously, using:

predictions = learn.get_preds(DatasetType.Test, ordered=True)

Which leaves me with more fundamental question:

How do you structure your data from a question and answer dataset into a classifier that makes sense?