Error with creating test dataloaders from dataframe

I’m using fastai v2. I have trained a text classification model and want to use this model to predict on a new test dataset. The dataset is a dataframe with only one column ‘Text’ and no label column.

How do I create a data loaders object from this dataframe so I can predict on this dataset? I always encounter an error when using TextDataLoaders.from_df() method. The error message and code is attached below. Any help would be appreciated, thank you!
@muellerzr Not sure if I’m allowed to tag you here, but I would really appreciate any advice you could give me. Thank you.

dls = TextDataLoaders.from_df(df, text_col = ‘Text’)

/usr/local/lib/python3.6/dist-packages/numpy/core/_asarray.py:83: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify ‘dtype=object’ when creating the ndarray
return array(a, dtype, copy=False, order=order)

IndexError Traceback (most recent call last)
in ()
----> 1 dls = TextDataLoaders.from_df(bacteria, text_col = ‘Text’)

24 frames
/usr/local/lib/python3.6/dist-packages/pandas/core/series.py in getitem(self, key)
877
878 if is_integer(key) and self.index._should_fallback_to_positional():
–> 879 return self._values[key]
880
881 elif key_is_scalar:

IndexError: index 1 is out of bounds for axis 0 with size 1

@anshu1
Since you have already trained the model, the steps that you need to do (more complete example in [1])

  • loading the learner
  • tokenize the text columns in your test dataframe
  • initialize the test dataloader in the learner with the above tokenized dataframe
learn = load_learner('SAVED_MODEL')
tokenized_df = tokenize_df(test_df, text_cols='Text', tok_text_col='text') #returns a tuple
test_dl = learn.dls.test_dl(tokenized_df[0], with_labels=False)  #initialize the test dataloader
learn.get_preds(dl=test_dl) # Get the predictions on your test data loader using [2]

[1] [Knowledge Base] Adding test dataloader with multiple columns to learner with SentencePiece
[2] https://docs.fast.ai/learner.html#Learner.get_preds

2 Likes

@msivanes
Happy New Year and thank you SO much for your response - your solution worked!! I did have a follow-up question about getting predictions using get_preds(). I’m using the following code below to save my predictions for the test dataset.

preds,targs = learn.get_preds(dl=test_dl)
output = pd.DataFrame(columns = [‘predictions’])
output[“predictions”] = preds.numpy()[:, 0]

How do I get the actual class label using fastai? (In my case, it’s a binary classification problem with 0 and 1 as my target classes).

dls.vocab contains your class labels, so something like this should work.

out = pd.DataFrame(columns=['predictions'])
out['predictions'] = preds.argmax(dim=1).numpy()
out['predictions'] = out['predictions'].apply(lambda x: dls.vocab[x])

That last line could even get a little cleaner

out['predictions'] = out['predictions'].apply(dls.vocab.__getitem__)
2 Likes

@muellerzr Thank you so much for your answer - the preds.argmax worked. However, I am still getting an error when I run the code with “dls.vocab” (‘dls’ is not defined).

I just wanted to clarify if ‘dls’ here refers to the test dataloaders object or the model itself.

This would be learn.dls.vocab. You may be able to do dl.vocab (if dl is your generated test_dl)

Thank you, I understand it now. Also, I wanted to confirm what the argument ‘reorder = True’ means when generating predictions?

I set it to False in get_preds, is that the correct way of getting predicted labels for your test data?