Is there a way to get predictions against future test datasets not available during training?

wgpubs · November 14, 2018, 5:47am

Is there a way to get predictions against future test datasets not available during training?

For example, let’s suppose I’m able to create a DataFrame from a future .csv as such …

test_list = (TextList
           .from_df(test_df, path=path, col=['text'], processor=txt_proc)
          )

Is there a way I can load my text_classifier_learner and have it make predictions on this TextList?

wgpubs · November 14, 2018, 7:18pm

This is what I’m doing right now to handle running text classification predictions for a given .csv file. It works, but it’s slow (takes about 30 minutes to process 20k rows).

test_df = pd.read_csv(path/'20181114_data.csv')

txt_proc = [
    TokenizeProcessor(tokenizer=None),
    NumericalizeProcessor(vocab=data_lm.vocab)
]

test_txt_list = (TextList
                 .from_df(test_df, path=path, col=['text'], processor=txt_proc)
                 .process())

learn.model = learn.model.to('cpu')
learn.model = learn.model.eval()

test_results = []
with torch.no_grad():
    for i, doc in enumerate(test_txt_list.items):
        if i % 10000 == 0: print(i)
        probs, raw_outputs, outputs = learn.model(tensor(doc).unsqueeze(1))
        test_results.append({ 'probs': probs[0].tolist(), 'prediction': np.argmax(probs, axis=1).item() })

If there is a better way, I’d love to get some feedback. In particular:

Is there a way to load just the vocab from the LM without loading everything else
Is there a way to make this faster

This idea where I’ll be getting a .csv sent my way periodically to make predictions on is pretty much how I imagine a system I’m building for work will work. Any improvements/recommendations are welcome.

nikhil.ikhar · November 22, 2018, 1:44am

after creating the learner, you have to use predict method. this works in version 1.0.28.

test_clas = TextClasDataBunch.from_csv(path, filename,  bs=50)
learn = text_classifier_learner(test_clas, drop_mult=0.5)
learn.load('saved_model')
learn.predict(img='string for prediction')

wgpubs · November 22, 2018, 2:57am

How would you get predictions for your test set (here represented by test_cls)?

If I understand things correctly, the code here assigns the test dataset to the DataBunch’s training dataset which I don’t see a way to get predictions for. Also, this dataset will be shuffled and so we won’t be able to align it with the raw data unless I’m missing something.

Thoughts?