Fast.ai predict_array() for NLP classification

echan00 · May 29, 2018, 8:14am

I am trying to classify IMDB comments of my own. I have tokenized my comments and converted them into IDs:

df_pred = pd.read_csv(LM_PATH/'predict.csv', header=None, chunksize=chunksize)
tok_pred, pred_labels = get_all(df_pred, 1)
pred_ids = np.array([[stoi[o] for o in p] for p in tok_pred])

I’m trying to figure out what is the exact data structure/type required for the fast.ai predict_array() function.

I’ve tried multiple different types of arrays without any luck. Is there something I’m missing?

wgpubs · May 30, 2018, 6:12pm

Always a good idea to post your code here so folks can see what is what.

My guess is that you are missing a dimension (right now you have a zero dimensional np array … do a pred_ids and you’ll see what I’m talking about). You need to add an extra dimension:

np.expand(pred_ids, axis=1)

Take a look at the output for one of your batches when you train. You’ll notice that a document isn’t numericalized across rows, but rather columns.

echan00 · May 30, 2018, 6:45pm

Thanks @wgpubs

Below is pred_ids. It is two arrays of IDs (two IMDB comments) within an array:

[[40,
  41,
  42,
  ...
  537,
  637,
  3],
 [40,
  41,
  42,
  ...
  7,
  74,
  3]]

I tried np.expand_dims(pred_ids, axis=1) to add a dimension as you suggested, but still no luck (see below):

array([[list([40, 41, 42, 39, 11, 6, ... 607, 23307, 3021, 637, 537, 637, 3])],
       [list([40, 41, 42, 39, 13, ... , 10181, 5, 55, 6, 616, 477, 7, 74, 3])]],
      dtype=object)

prediction = learn.predict_array(np.expand_dims(pred_ids, axis=1))

The error I receive is NotImplementedError: object where the array is not a np.float as required by the predict_array function in learner.py and then core.py.

Take a look at the output for one of your batches when you train. You’ll notice that a document isn’t numericalized across rows, but rather columns.

Would you mind clarifying what you mean by this?

wgpubs · May 30, 2018, 7:09pm

Sure.

After you create your ModelData object … take a look at one of your batches like so:

x, y = next(iter(trn_dl))
print(x.size(), x.type(), y.size(), y.type()

You can also just print(x) and print(y) if you like.

This is a helpful technique to a) ensure your ModelData object is working, b) that your dataloaders are returning the data in the correct format and type, and c) to understand the data format your pytorch model is expecting.

I include that code in my notebooks in most of my notebooks after creating my dataloaders as a check that all is well before proceeding to training.

urmas.pitsi · May 30, 2018, 7:58pm

make sure that all p-s are the same length in your tok_pred: pred_ids = np.array([[stoi[o] for o in p] for p in tok_pred])
pred_ids = np.array([[stoi[o] for o in p] for p in tok_pred]) output shoud be np.array[num_sentences, num_tok_in_sentence], but you have array of lists as I can see.

echan00 · June 4, 2018, 9:19am

@wgpubs thanks for the clarification. Your suggestion makes sense to me but I’m still having trouble:

x, y = next(iter(trn_dl))
print(x.size(), x.type(), y.size(), y.type())

returns:

torch.Size([482, 24]) torch.cuda.LongTensor torch.Size([24]) torch.cuda.LongTensor

Every iteration of trn_dl returns a x pytorch long tensor of torch.Size([ N, 24]) which I confirmed is what the pytorch model is expecting.

I also used the same code as the in the fast.ai predict_array function and was able to confirm that the resulting data type and form is correct:

to_gpu(V(T(x))))

returns:

torch.cuda.LongTensor of size 129x24 (GPU 0)

When I try the prediction:

learn.model.eval()
to_np(learn.model(to_gpu(V(T(x)))))

However, the resulting list of arrays does not seem to look like the classification I am expecting:

[array([[ 4.76909, -5.10486],
    [-7.02031,  7.0272 ],
    [ 4.62994, -4.94379],
    ...
    [ 4.42602, -4.64515],
    [-6.98176,  6.9849 ],  
    [ 6.46633, -6.73317],
    [ 5.12761, -5.3733 ],
    [-4.51221,  4.46605]], dtype=float32),
[array([[[-0.01546,  0.00402,  0.00247, ...,  0.21298, -0.02448,  0.00726],
      [-0.01546,  0.00402,  0.00247, ...,  0.21298, -0.02448,  0.00726],
      [-0.01546,  0.00402,  0.00247, ...,  0.21298, -0.02448,  0.00726],
      ...,
      ...,
     [[-0.00816,  0.00167,  0.21515, ...,  0.00561,  0.06843,  0.15961],
      [-0.08275,  0.03771, -0.45695, ...,  0.15028,  0.08362,  0.02582],
      [ 0.0016 ,  0.00206,  0.01634, ...,  0.01263,  0.03239,  0.00477],
      ...,
      [-0.06923,  0.02377,  0.00858, ...,  0.01588,  0.04878,  0.03017],
      [-0.0011 ,  0.01422, -0.02462, ...,  0.02862,  0.08323,  0.00347],
      [-0.06961, -0.00985, -0.23971, ...,  0.33725,  0.01575,  0.00276]],

     [[ 0.00032, -0.00368,  0.00575, ...,  0.02081,  0.05157,  0.06115],
      [-0.05437,  0.01115, -0.10385, ...,  0.15507,  0.10765,  0.00231],
      [ 0.00508,  0.0113 ,  0.01106, ...,  0.02161,  0.06531,  0.00207],
      ...,
      [-0.01544,  0.03345,  0.02451, ...,  0.02918,  0.09549,  0.00268],
      [-0.00228,  0.01238, -0.00676, ...,  0.02629,  0.04236,  0.00015],
      [-0.02925, -0.00423, -0.25921, ...,  0.45078,  0.14508, -0.00259]]], dtype=float32)]]

The IMDB classification should result in an array of size two. One number for reflecting the likelihood of a POS review and another number reflecting the NEG review?

echan00 · June 4, 2018, 9:25am

Hi @urmas.pitsi, why would pred_ids be np.array[num_sentences, num_tok_in_sentence]?

Shouldn’t pred_ids include the data that needs to be classified? (In this case some array of tokens representing the comment I want to have classified as POS or NEG)

urmas.pitsi · June 4, 2018, 9:54am

as I understand we want to predict multiple samples at a time. So we should give an array of examples to the model. Each row in this array represents a sentence that we would like to predict. If we try to send all sentences directly to GPU for prediction, then they need to be the same size: tensor size is fixed, eg: 100x50 etc.
If our sample consists of a list of variable length sentences, then in we could either loop one-by-one and send them to predictor as is or we pad/truncate them to have fixed size tensor and predict in one go on that fixed-sized array.

echan00 · June 4, 2018, 12:45pm

I tried giving the model a a tensor with one array so that I can work-around the problem you pose for the time being but am still having issues with the model returning a list of arrays (see post above)… instead of an array with size 2 (# of labels).

The problem you pose doesn’t sound like it can be solved with the fast.ai predict_array() function. It sounds I need to use the predict function but for a sample set instead of test set? Would someone be able to confirm this?

urmas.pitsi · June 4, 2018, 3:51pm

Actually your result could very well be what you’re looking for. I don’t know what are you doing exactly, but I guess that first element of the result is predictions: nr_samples x 2. If you’d softmaxed these by rows then you’d end up with 2 probabilities for each row: pos and neg prediction. Could that be the case here?

echan00 · June 4, 2018, 4:32pm

Hmm, it’s possible but I don’t think so. The returning prediction (Y) should be two number outputs (pos & neg) and nothing more just like the Y input used to train the model?

MatthiasBachfischer · June 4, 2018, 5:43pm

Hi @echan00,

I have adapted the IMDB notebook to make sentiment predictions on Twitter data and published it here: german2vec

I am not sure about the predict_array() function, but I have included some code to perform manual predictions on sentences (one sentence at a time).

#sentence to be classified
sen = "guten morgen schon novemberlicht da mag man gar nicht glauben was gestern wetter war"
idxs = np.array([[stoi[p] for p in sen.strip().split(" ")]])

#converting sentence into numerical representation
print('numeric representation: ' + str(idxs))
idxs = np.transpose(idxs)

#get predictions from model
p = m(VV(idxs))
print(sen + ': ' + str(to_np(torch.topk(p[0],1)[1])[0]))

Please have a look at the last section of the notebook (titled Experiments) for further information.

Hope this helps!

Matthias

wgpubs · June 4, 2018, 6:19pm

Isn’t that what you are getting in the first array?

Take a look at the code for PoolingLinearClassifier. It returns 3 things (the first is the predictions): return l_x, raw_outputs, outputs

echan00 · June 5, 2018, 8:24am

This makes a lot of sense.

To test, I predicted one comment with one sentence:

"This movie sucks. I would never come back to see this movie again. It is distasteful and just a complete waste of time."

And the first output l_x of the prediction is:

array([[    5.06974,    -5.39295],
       [    5.54907,    -5.61736],
       [-4907.2866 , -6237.5317 ],
       [   51.54634,    26.35349],
       [   11.07813,   -11.46861],
       [    8.99281,    -9.36761],
       [  156.28255,   134.86494],
       [    3.09114,    -3.43671],
       [   11.61319,   -11.29998],
       [   21.02145,    -8.87303],
       [   14.00674,   -13.49925],
       [   32.99422,     7.12977],
       [-3234.554  , -3277.3706 ],
       [    7.04145,    -6.41294],
       [ -696.333  ,  -696.00507],
       [   11.07813,   -11.46861],
       [    8.99281,    -9.36761],
       [    6.17078,    -6.43465],
       [    3.09114,    -3.43671],
       [    6.9118 ,    -7.50732],
       [   -5.05259,     5.08622],
       [    9.96617,   -10.12175],
       [    5.08797,    -5.19022],
       [ 1658.1726 ,  1640.629  ],
       [   17.19005,     0.4159 ],
       [ 3276.3281 ,  3045.2246 ],
       [  784.67566,   752.30774],
       [ -192.18279,  -197.83821],
       [-6243.359  , -6347.466  ],
       [    3.09114,    -3.43671]], dtype=float32)

Is it not a bit presumptuous to assume the first row [ 5.06974, -5.39295] is the prediction for that comment/sentence?

On another note, I really appreciate @wgpubs @urmas.pitsi working with me on this.

urmas.pitsi · June 5, 2018, 9:06am

That would make perfect sense if you softmax this.
you’re welcome! I’m constantly in your position myself - struggling to interpret the results and fighting errors.

echan00 · June 5, 2018, 9:08am

Thanks for sharing @MatthiasBachfischer. Awesome stuff.

I’m actually hoping to eventually build a classification model for entire texts. For example whether the document mentions/discusses a particular item. And if so, which category is that particular excerpt about. I imagine this a bit different compared to sentiment.

Would you mind sharing an example of what the variables idxs and p from your code look like? Interested in seeing the data type and size.

MatthiasBachfischer · June 5, 2018, 9:28am

Here you go

gist.github.com

https://gist.github.com/Bachfischer/f26e086ad84fd7cc778429d7b86a4ea9

Output

numeric representation: [[ 1611  2295   179     0    97  5670    96  1668    47  2236   180 28919  5816    33]]
Value of idxs: [[ 1611]
 [ 2295]
 [  179]
 [    0]
 [   97]
 [ 5670]
 [   96]
 [ 1668]
 [   47]

This file has been truncated. show original

fastai-manual_evaluation.py

# single prediction on last classifier
learn.load('clas_2')
m = learn.model
#set batch size to 1
m[0].bs=1
#turn off dropout
m.eval()
#reset hidden state
m.reset()

This file has been truncated. show original

echan00 · June 5, 2018, 9:44am

Thanks. Your p[0] is a torch.FloatTensor of size 1x3 which looks like what i should be expecting also.

raf · August 27, 2018, 12:13pm

Great! This is exactly what I needed.
Thanks!

samh · September 10, 2018, 6:35pm

I built a multi-class (29 classes) classifier following the IMDB example, and it worked just like the binary classifier