How to interpret IMDB sentiment predictions?

jeremy · November 26, 2017, 9:01pm

FYI as I’m sure you’ve noticed, I haven’t used a test set with this class before - sorry about the shuffling thing! I’m working on tomorrow’s class at the moment so won’t be able to debug right away, but if you want to do so, try looking at how torchtext is handling this. I’m not sure if the issue is in torchtext, or just how I’m calling it.

Both torchtext and fastai are pretty simple code to read - hopefully it’ll be reasonably clear what’s going on. Let me know if I can help clarify anything!

rob · November 26, 2017, 9:06pm

@KevinB was able to get a submission into the Happiness competition, so I think he has something that works, and I don’t think it’s as complicated as we are making it. Perhaps he can enlighten us when he has a chance.

wgpubs · November 26, 2017, 9:07pm

I think the issue is in torchtext.

If you look at the source code for BucketIterator here, you can see that it always sorts (even if you set sort=False. That simply, shifts the sorting to happen in the batches.

KevinB · November 26, 2017, 9:18pm

So all I did is use what Jeremy did in his lesson 4 notebook to predict what the sentence will be. I set my batchsize to 1 and I pulled the text from the CSV file directly. Then I just looped through those one at a time and tied them to a file. Then I just chose the top prediction and converted it from the index to the actual word. Is there any specific code/questions you are wondering about?

rob · November 26, 2017, 9:40pm

Can you share the code for how you loop through examples to do prediction one by one?

KevinB · November 26, 2017, 9:47pm

m = m3.model 
m[0].bs = 1
for i in range(tst.values[:,1].shape[0]):
    ss = tst["Description"][i] #Actual text review
    s = [spacy_tok(ss)]
    t = TEXT.numericalize(s)
   
    m.eval()
    m.reset()
    res,*_ = m(t)
    prediction = PH_LABEL.vocab.itos[to_np(torch.topk(res[-1], 1)[1])[0]]

rob · November 26, 2017, 10:08pm

Great thanks! I got it working.

I had previously missed the first line where you have to do m = m3.model before calling m(t). I guess that is easy to miss since m3 is the result of a call to get_model().

Anyway, good on you for getting this to work. It must mean you understand the code quite well.

runze · November 26, 2017, 11:05pm

This is much easier than trying to modify the TextData class! I also like that training/validation is completely decoupled with testing this way.

wgpubs · November 27, 2017, 1:38am

I think I’m getting close to solving this by replacing BucketIterator with Iterator.

I’m getting predictions from my test dataset, BUT there are less predictions than there are examples in the test dataset for some reason (8 less to be specific). See below:

Any ideas about why I don’t have a prediction for every example?

jeremy · November 27, 2017, 2:25am

BucketIterator is important for sorting by length. Otherwise it can be very inefficient if you have stuff of different lengths (which is v common).

You’re missing some predictions because it always uses a integer multiple of batch size. So I guess bs=1 is important for that reason too!

wgpubs · November 27, 2017, 2:34am

Hmmm … ok.

So basically, we need BucketIterator for performance reasons … and BucketIterator is going to do some kind of sorting (either over the entire dataset or within each minibatch). So for validation and test datasets, we’re going to have to change our bs=1 in order for the results to come back in order.

Does that sound about right?

wgpubs · November 27, 2017, 3:11am

Changed bs=1 and that gets me 8,391 predictions (still missing 1)

jeremy · November 27, 2017, 4:37am

That missing 1 sounds like a bug, maybe?

I think doing bs=1 sounds like a pretty good solution really…

wgpubs · November 27, 2017, 5:33am

Yah I ultimately went with @KevinB’s approach above.

I was trying to modify the framework to make the behavior more similar to image classification … got close, but not close enough. I will at least have a PR coming your way to include an optional test dataset to TextData.from_splits() that works.

kevindewalt · April 16, 2018, 10:12pm

FWIW, this worked for me:

F.softmax(Variable(torch.from_numpy(pre_preds), dim=1))

kevindewalt · April 17, 2018, 1:24am

I’m struggling to see how this approach works.

t = TEXT.numericalize(s)

always gives me Variable with all 0s.

Looking at the source code, TEXT.numericalize seems to be expecting list of strings(or tokens) such as

['my',
 'students',
 'are',
 'wonderful',
 'young',
 'learners',
 '.',
 ' ',
 'they',
 'are',
 'entering',
 'school',
... ]

I get the same error in Jeremy’s IMDB lesson 4 notebook.

kevindewalt · April 17, 2018, 11:25am

Ok, just realizing Jeremy updated all of the NLP stuff a few months ago. Suggest waiting for Part 2 to be released …

mingray · April 20, 2018, 6:03am

FILES = dict(train=TRN_PATH, validation=VAL_PATH, test=VAL_PATH)
md = LanguageModelData.from_text_files(PATH, TEXT, **FILES, bs=bs, bptt=bptt, min_freq=10)

After building our ModelData object, it automatically fills the TEXT object with a very important attribute: TEXT.vocab. This is a vocabulary, which stores which words (or tokens) have been seen in the text, and how each word will be mapped to a unique integer id. We’ll need to use this information again later, so we save it.

you have to load the TEXT object first

mingray · April 20, 2018, 6:04am

This is what I did to make it work

m=m3.model

test_str ="bitcoin is bad"
tokenized_str = spacy_tok(test_str)
token_lst = [sent.string.strip() for sent in tokenized_str]

t = TEXT.numericalize([token_lst])
print(t)

# Set batch size to 1
m[0].bs=1

# Turn off dropout
m.eval()
# Reset hidden state
m.reset()
# Get predictions from model
res,*_ = m(t)
num_res = to_np(torch.topk(res[-1], 1)[1])[0]

print(num_res)

output:

Variable containing:
  0
  9
 93
[torch.cuda.LongTensor of size 3x1 (GPU 0)]

1

kevindewalt · April 20, 2018, 11:38am

Thanks, this is the missing step: