What's the canonical way to yield predictions from the RNNLearner in the IMDB ipynb?

dgaff · October 9, 2018, 10:11pm

Hey all! I’ve been working through materials all day with regards to getting a learner set up on text classification problems. In particular, I’ve been focused in on the notebook that’s provided as a tutorial for learning how to build an RNNLearner classifier via the IMDB corpus as shown in https://github.com/fastai/fastai/blob/master/examples/text.ipynb. I was surprised to see that the documentation consistently stops short of showing how to take models generated in fastai
and apply them to unseen strings, and yield predictions for those strings - as a practitioner, it’s the only thing I’m actually interested in. I opened a ticket asking if a particular setup for classification of new texts (cribbed from https://github.com/fastai/fastai/blob/75e8ae03466b55e204ef8b2a314bd58ca7c7f96e/courses/dl2/imdb_scripts/predict_with_classifier.py) was “canonical” in some sense (the ticket in question: https://github.com/fastai/fastai/issues/873). Does anyone have a better solution than mine?

from fastai.text import *
from torch.autograd.variable import Variable
def softmax(x):
    if x.ndim == 1:
        x = x.reshape((1, -1))
    max_x = np.max(x, axis=1).reshape((-1, 1))
    exp_x = np.exp(x - max_x)
    return exp_x / np.sum(exp_x, axis=1).reshape((-1, 1))

model = learn.classifier(data_clas).model
model.reset()
model.eval()
df = pd.read_csv(IMDB_PATH/'valid.csv', header=None)
stoi = data_lm.train_ds.vocab.stoi
results = []
for text in df[1]:
    texts = [text]
    tok = Tokenizer().process_all(texts)
    encoded = [stoi[p] for p in tok[0]]
    ary = np.reshape(np.array(encoded),(-1,1))
    tensor = torch.from_numpy(ary)
    variable = Variable(tensor)
    predictions = model(variable)
    numpy_preds = predictions[0].data.numpy()
    score = softmax(numpy_preds[0])[0].tolist()
    print(score)
    results.append(score)