Error with creating test dataloaders from dataframe

anshu1 · December 31, 2020, 3:06pm

I’m using fastai v2. I have trained a text classification model and want to use this model to predict on a new test dataset. The dataset is a dataframe with only one column ‘Text’ and no label column.

How do I create a data loaders object from this dataframe so I can predict on this dataset? I always encounter an error when using TextDataLoaders.from_df() method. The error message and code is attached below. Any help would be appreciated, thank you!
@muellerzr Not sure if I’m allowed to tag you here, but I would really appreciate any advice you could give me. Thank you.

dls = TextDataLoaders.from_df(df, text_col = ‘Text’)

/usr/local/lib/python3.6/dist-packages/numpy/core/_asarray.py:83: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify ‘dtype=object’ when creating the ndarray
return array(a, dtype, copy=False, order=order)

IndexError Traceback (most recent call last)
in ()
----> 1 dls = TextDataLoaders.from_df(bacteria, text_col = ‘Text’)

24 frames
/usr/local/lib/python3.6/dist-packages/pandas/core/series.py in getitem(self, key)
877
878 if is_integer(key) and self.index._should_fallback_to_positional():
–> 879 return self._values[key]
880
881 elif key_is_scalar:

IndexError: index 1 is out of bounds for axis 0 with size 1

msivanes · December 31, 2020, 4:54pm

@anshu1
Since you have already trained the model, the steps that you need to do (more complete example in [1])

loading the learner
tokenize the text columns in your test dataframe
initialize the test dataloader in the learner with the above tokenized dataframe

learn = load_learner('SAVED_MODEL')
tokenized_df = tokenize_df(test_df, text_cols='Text', tok_text_col='text') #returns a tuple
test_dl = learn.dls.test_dl(tokenized_df[0], with_labels=False)  #initialize the test dataloader
learn.get_preds(dl=test_dl) # Get the predictions on your test data loader using [2]

[1] [Knowledge Base] Adding test dataloader with multiple columns to learner with SentencePiece
[2] https://docs.fast.ai/learner.html#Learner.get_preds

anshu1 · January 1, 2021, 4:01pm

@msivanes
Happy New Year and thank you SO much for your response - your solution worked!! I did have a follow-up question about getting predictions using get_preds(). I’m using the following code below to save my predictions for the test dataset.

preds,targs = learn.get_preds(dl=test_dl)
output = pd.DataFrame(columns = [‘predictions’])
output[“predictions”] = preds.numpy()[:, 0]

How do I get the actual class label using fastai? (In my case, it’s a binary classification problem with 0 and 1 as my target classes).

muellerzr · January 1, 2021, 4:24pm

dls.vocab contains your class labels, so something like this should work.

out = pd.DataFrame(columns=['predictions'])
out['predictions'] = preds.argmax(dim=1).numpy()
out['predictions'] = out['predictions'].apply(lambda x: dls.vocab[x])

That last line could even get a little cleaner

out['predictions'] = out['predictions'].apply(dls.vocab.__getitem__)

anshu1 · January 4, 2021, 12:34pm

@muellerzr Thank you so much for your answer - the preds.argmax worked. However, I am still getting an error when I run the code with “dls.vocab” (‘dls’ is not defined).

I just wanted to clarify if ‘dls’ here refers to the test dataloaders object or the model itself.

muellerzr · January 6, 2021, 5:50pm

This would be learn.dls.vocab. You may be able to do dl.vocab (if dl is your generated test_dl)

anshu1 · January 22, 2021, 5:40am

Thank you, I understand it now. Also, I wanted to confirm what the argument ‘reorder = True’ means when generating predictions?

I set it to False in get_preds, is that the correct way of getting predicted labels for your test data?