This is what I’m doing right now to handle running text classification predictions for a given .csv file. It works, but it’s slow (takes about 30 minutes to process 20k rows).
test_df = pd.read_csv(path/'20181114_data.csv')
txt_proc = [
TokenizeProcessor(tokenizer=None),
NumericalizeProcessor(vocab=data_lm.vocab)
]
test_txt_list = (TextList
.from_df(test_df, path=path, col=['text'], processor=txt_proc)
.process())
learn.model = learn.model.to('cpu')
learn.model = learn.model.eval()
test_results = []
with torch.no_grad():
for i, doc in enumerate(test_txt_list.items):
if i % 10000 == 0: print(i)
probs, raw_outputs, outputs = learn.model(tensor(doc).unsqueeze(1))
test_results.append({ 'probs': probs[0].tolist(), 'prediction': np.argmax(probs, axis=1).item() })
If there is a better way, I’d love to get some feedback. In particular:
Is there a way to load just the vocab from the LM without loading everything else
Is there a way to make this faster
This idea where I’ll be getting a .csv sent my way periodically to make predictions on is pretty much how I imagine a system I’m building for work will work. Any improvements/recommendations are welcome.
How would you get predictions for your test set (here represented by test_cls)?
If I understand things correctly, the code here assigns the test dataset to the DataBunch's training dataset which I don’t see a way to get predictions for. Also, this dataset will be shuffled and so we won’t be able to align it with the raw data unless I’m missing something.