Fast.ai predict_array() for NLP classification

I am trying to classify IMDB comments of my own. I have tokenized my comments and converted them into IDs:

df_pred = pd.read_csv(LM_PATH/'predict.csv', header=None, chunksize=chunksize)
tok_pred, pred_labels = get_all(df_pred, 1)
pred_ids = np.array([[stoi[o] for o in p] for p in tok_pred])

I’m trying to figure out what is the exact data structure/type required for the fast.ai predict_array() function.

I’ve tried multiple different types of arrays without any luck. Is there something I’m missing?

Always a good idea to post your code here so folks can see what is what.

My guess is that you are missing a dimension (right now you have a zero dimensional np array … do a pred_ids and you’ll see what I’m talking about). You need to add an extra dimension:

np.expand(pred_ids, axis=1)

Take a look at the output for one of your batches when you train. You’ll notice that a document isn’t numericalized across rows, but rather columns.

Thanks @wgpubs

Below is pred_ids. It is two arrays of IDs (two IMDB comments) within an array:

[[40,
  41,
  42,
  ...
  537,
  637,
  3],
 [40,
  41,
  42,
  ...
  7,
  74,
  3]] 

I tried np.expand_dims(pred_ids, axis=1) to add a dimension as you suggested, but still no luck (see below):

array([[list([40, 41, 42, 39, 11, 6, ... 607, 23307, 3021, 637, 537, 637, 3])],
       [list([40, 41, 42, 39, 13, ... , 10181, 5, 55, 6, 616, 477, 7, 74, 3])]],
      dtype=object)

prediction = learn.predict_array(np.expand_dims(pred_ids, axis=1))

The error I receive is NotImplementedError: object where the array is not a np.float as required by the predict_array function in learner.py and then core.py.

Take a look at the output for one of your batches when you train. You’ll notice that a document isn’t numericalized across rows, but rather columns.

Would you mind clarifying what you mean by this?

Sure.

After you create your ModelData object … take a look at one of your batches like so:

x, y = next(iter(trn_dl))
print(x.size(), x.type(), y.size(), y.type()

You can also just print(x) and print(y) if you like.

This is a helpful technique to a) ensure your ModelData object is working, b) that your dataloaders are returning the data in the correct format and type, and c) to understand the data format your pytorch model is expecting.

I include that code in my notebooks in most of my notebooks after creating my dataloaders as a check that all is well before proceeding to training.

1 Like
  1. make sure that all p-s are the same length in your tok_pred: pred_ids = np.array([[stoi[o] for o in p] for p in tok_pred])

  2. pred_ids = np.array([[stoi[o] for o in p] for p in tok_pred]) output shoud be np.array[num_sentences, num_tok_in_sentence], but you have array of lists as I can see.

1 Like

@wgpubs thanks for the clarification. Your suggestion makes sense to me but I’m still having trouble:

x, y = next(iter(trn_dl))
print(x.size(), x.type(), y.size(), y.type())

returns:

torch.Size([482, 24]) torch.cuda.LongTensor torch.Size([24]) torch.cuda.LongTensor

Every iteration of trn_dl returns a x pytorch long tensor of torch.Size([ N, 24]) which I confirmed is what the pytorch model is expecting.

I also used the same code as the in the fast.ai predict_array function and was able to confirm that the resulting data type and form is correct:

to_gpu(V(T(x))))

returns:

torch.cuda.LongTensor of size 129x24 (GPU 0)

When I try the prediction:

learn.model.eval()
to_np(learn.model(to_gpu(V(T(x)))))

However, the resulting list of arrays does not seem to look like the classification I am expecting:

[array([[ 4.76909, -5.10486],
    [-7.02031,  7.0272 ],
    [ 4.62994, -4.94379],
    ...
    [ 4.42602, -4.64515],
    [-6.98176,  6.9849 ],  
    [ 6.46633, -6.73317],
    [ 5.12761, -5.3733 ],
    [-4.51221,  4.46605]], dtype=float32),
[array([[[-0.01546,  0.00402,  0.00247, ...,  0.21298, -0.02448,  0.00726],
      [-0.01546,  0.00402,  0.00247, ...,  0.21298, -0.02448,  0.00726],
      [-0.01546,  0.00402,  0.00247, ...,  0.21298, -0.02448,  0.00726],
      ...,
      ...,
     [[-0.00816,  0.00167,  0.21515, ...,  0.00561,  0.06843,  0.15961],
      [-0.08275,  0.03771, -0.45695, ...,  0.15028,  0.08362,  0.02582],
      [ 0.0016 ,  0.00206,  0.01634, ...,  0.01263,  0.03239,  0.00477],
      ...,
      [-0.06923,  0.02377,  0.00858, ...,  0.01588,  0.04878,  0.03017],
      [-0.0011 ,  0.01422, -0.02462, ...,  0.02862,  0.08323,  0.00347],
      [-0.06961, -0.00985, -0.23971, ...,  0.33725,  0.01575,  0.00276]],

     [[ 0.00032, -0.00368,  0.00575, ...,  0.02081,  0.05157,  0.06115],
      [-0.05437,  0.01115, -0.10385, ...,  0.15507,  0.10765,  0.00231],
      [ 0.00508,  0.0113 ,  0.01106, ...,  0.02161,  0.06531,  0.00207],
      ...,
      [-0.01544,  0.03345,  0.02451, ...,  0.02918,  0.09549,  0.00268],
      [-0.00228,  0.01238, -0.00676, ...,  0.02629,  0.04236,  0.00015],
      [-0.02925, -0.00423, -0.25921, ...,  0.45078,  0.14508, -0.00259]]], dtype=float32)]]

The IMDB classification should result in an array of size two. One number for reflecting the likelihood of a POS review and another number reflecting the NEG review?

Hi @urmas.pitsi, why would pred_ids be np.array[num_sentences, num_tok_in_sentence]?

Shouldn’t pred_ids include the data that needs to be classified? (In this case some array of tokens representing the comment I want to have classified as POS or NEG)

as I understand we want to predict multiple samples at a time. So we should give an array of examples to the model. Each row in this array represents a sentence that we would like to predict. If we try to send all sentences directly to GPU for prediction, then they need to be the same size: tensor size is fixed, eg: 100x50 etc.
If our sample consists of a list of variable length sentences, then in we could either loop one-by-one and send them to predictor as is or we pad/truncate them to have fixed size tensor and predict in one go on that fixed-sized array.

I tried giving the model a a tensor with one array so that I can work-around the problem you pose for the time being but am still having issues with the model returning a list of arrays (see post above)… instead of an array with size 2 (# of labels).

The problem you pose doesn’t sound like it can be solved with the fast.ai predict_array() function. It sounds I need to use the predict function but for a sample set instead of test set? Would someone be able to confirm this?

Actually your result could very well be what you’re looking for. I don’t know what are you doing exactly, but I guess that first element of the result is predictions: nr_samples x 2. If you’d softmaxed these by rows then you’d end up with 2 probabilities for each row: pos and neg prediction. Could that be the case here?

Hmm, it’s possible but I don’t think so. The returning prediction (Y) should be two number outputs (pos & neg) and nothing more just like the Y input used to train the model?

Hi @echan00,

I have adapted the IMDB notebook to make sentiment predictions on Twitter data and published it here: german2vec

I am not sure about the predict_array() function, but I have included some code to perform manual predictions on sentences (one sentence at a time).

#sentence to be classified
sen = "guten morgen schon novemberlicht da mag man gar nicht glauben was gestern wetter war"
idxs = np.array([[stoi[p] for p in sen.strip().split(" ")]])

#converting sentence into numerical representation
print('numeric representation: ' + str(idxs))
idxs = np.transpose(idxs)

#get predictions from model
p = m(VV(idxs))
print(sen + ': ' + str(to_np(torch.topk(p[0],1)[1])[0]))

Please have a look at the last section of the notebook (titled Experiments) for further information.

Hope this helps!

Matthias

4 Likes

Isn’t that what you are getting in the first array?

Take a look at the code for PoolingLinearClassifier. It returns 3 things (the first is the predictions): return l_x, raw_outputs, outputs

3 Likes

This makes a lot of sense.

To test, I predicted one comment with one sentence:

"This movie sucks. I would never come back to see this movie again. It is distasteful and just a complete waste of time."

And the first output l_x of the prediction is:

array([[    5.06974,    -5.39295],
       [    5.54907,    -5.61736],
       [-4907.2866 , -6237.5317 ],
       [   51.54634,    26.35349],
       [   11.07813,   -11.46861],
       [    8.99281,    -9.36761],
       [  156.28255,   134.86494],
       [    3.09114,    -3.43671],
       [   11.61319,   -11.29998],
       [   21.02145,    -8.87303],
       [   14.00674,   -13.49925],
       [   32.99422,     7.12977],
       [-3234.554  , -3277.3706 ],
       [    7.04145,    -6.41294],
       [ -696.333  ,  -696.00507],
       [   11.07813,   -11.46861],
       [    8.99281,    -9.36761],
       [    6.17078,    -6.43465],
       [    3.09114,    -3.43671],
       [    6.9118 ,    -7.50732],
       [   -5.05259,     5.08622],
       [    9.96617,   -10.12175],
       [    5.08797,    -5.19022],
       [ 1658.1726 ,  1640.629  ],
       [   17.19005,     0.4159 ],
       [ 3276.3281 ,  3045.2246 ],
       [  784.67566,   752.30774],
       [ -192.18279,  -197.83821],
       [-6243.359  , -6347.466  ],
       [    3.09114,    -3.43671]], dtype=float32)

Is it not a bit presumptuous to assume the first row [ 5.06974, -5.39295] is the prediction for that comment/sentence?

On another note, I really appreciate @wgpubs @urmas.pitsi working with me on this.

1 Like

That would make perfect sense if you softmax this.
you’re welcome! I’m constantly in your position myself - struggling to interpret the results and fighting errors.

2 Likes

Thanks for sharing @MatthiasBachfischer. Awesome stuff.

I’m actually hoping to eventually build a classification model for entire texts. For example whether the document mentions/discusses a particular item. And if so, which category is that particular excerpt about. I imagine this a bit different compared to sentiment.

Would you mind sharing an example of what the variables idxs and p from your code look like? Interested in seeing the data type and size.

Here you go :wink:

3 Likes

Thanks. Your p[0] is a torch.FloatTensor of size 1x3 which looks like what i should be expecting also.

Great! This is exactly what I needed.
Thanks!

I built a multi-class (29 classes) classifier following the IMDB example, and it worked just like the binary classifier