Hi,
I have a classifier model trained with fastai text in v1.0.39 and I’m trying to run inference on a large number of sentences (> 1M). I’m not using GPU and I noticed that if I just loop through all sentences and get the model output with predict(), is way slower than using get_preds() on large batches of data (~30-50k at the time). The problem I’m facing is that if I cross check the output from get_preds() for a given sentence with the corresponding output from predict(), often the classification is different.
In my case the model is a binary classifier where I expect the large majority of data to be negative. The positive samples from get_predict() often get classified as negative with a single call to predict().
I’m using get_preds() in the following way:
data_test= TextClasDataBunch.from_csv(path, 'train.csv', test='test.csv', label_cols='label', vocab=vocab)
ll = text_classifier_learner(data_test)
ll.load_encoder('encXX')
ll.load('modelXX')
preds = ll.get_preds(ds_type = DatasetType.Test, ordered=True)
I then extract the positive samples:
preds_conv = [v.numpy() for v in preds[0]]
index_positive = [i for i,v in enumerate(preds_conv) if v[1]>= 0.5]
print(index_positive)
and run predict() on those as a cross-check. Often, they get now classified as negative.
Am I doing anything wrong here?
Thanks,
David