Bug in TextClasDataBunch.from_df?

tarvoc · November 14, 2018, 10:07pm

Say we have a binary text classification task with labels ‘a’ and ‘b’. If we use TextClasDataBunch.from_df, and the first label encountered in the train_df is ‘a’, ‘a’ will get mapped to 0 and ‘b’ will get mapped to 1. If the first label encountered in valid_df is ‘b’, ‘b’ will get mapped to 0 and ‘a’ will get mapped to 1. This means that during training, the accuracy on the validation set will be <0.5 and decreasing.

Following is a minimal example for illustration:

from fastai import *
from fastai.text import *
train_df = pd.DataFrame({‘label’: [‘a’, ‘b’] * 400, ‘text’: [‘foo’, ‘bar’] * 400})
valid_df = pd.DataFrame({‘label’: [‘b’, ‘a’] * 100, ‘text’: [‘bar’, ‘foo’] * 100})
data = TextClasDataBunch.from_df(path=".", train_df=train_df, valid_df=valid_df)
learn = text_classifier_learner(data)
learn.fit_one_cycle(4)
probs, y_correct = learn.get_preds()
probs[:2] # Output: tensor([[0.0035, 0.9965], [0.9963, 0.0037]])
y_correct[:2] # Output: tensor([0, 1])`

sgugger · November 14, 2018, 10:51pm

What version do you have? I just ran your code and it seems to be ok for me.
The main test is to check if data.train_ds.classes and data.valid_ds.classes are the same.

howkhang · November 15, 2018, 1:19am

I faced the same problem 5 days ago on v 1.0.22 and did not have time to investigate further so I just saved my data in folders and used the load from folder method for the time being. I haven’t gone back to investigate this further.

https://forums.fast.ai/t/incorrect-mapping-of-class-ids-in-textsplitdatasets-valid-ds-textdataset-from-df/29883

tarvoc · November 15, 2018, 6:51am

I was on version 1.0.22. Updating to 1.0.24 solved the issue above for me.

Another issue (still present in version 1.0.24) seems to be that TextClasDataBunch.from_df ignores the column names and simply assumes that the first column describes the classes. For example

df = pd.DataFrame({‘label’: [‘a’, ‘b’], ‘text’: [‘0’, ‘1’]})
data = TextClasDataBunch.from_df(path=".", train_df=df, valid_df=df)
data.train_ds.classes # gives [‘a’, ‘b’]

df = pd.DataFrame({‘text’: [‘0’, ‘1’], ‘label’: [‘a’, ‘b’]})
data = TextClasDataBunch.from_df(path=".", train_df=df, valid_df=df)
data.train_ds.classes # gives [‘0’, ‘1’]

sgugger · November 15, 2018, 2:28pm

That’s not an issue. You have arguments to pass the column for text and labels, otherwise it assumes 0 and 1.