Say we have a binary text classification task with labels ‘a’ and ‘b’. If we use TextClasDataBunch.from_df, and the first label encountered in the train_df is ‘a’, ‘a’ will get mapped to 0 and ‘b’ will get mapped to 1. If the first label encountered in valid_df is ‘b’, ‘b’ will get mapped to 0 and ‘a’ will get mapped to 1. This means that during training, the accuracy on the validation set will be <0.5 and decreasing.
What version do you have? I just ran your code and it seems to be ok for me.
The main test is to check if data.train_ds.classes and data.valid_ds.classes are the same.
I faced the same problem 5 days ago on v 1.0.22 and did not have time to investigate further so I just saved my data in folders and used the load from folder method for the time being. I haven’t gone back to investigate this further.
I was on version 1.0.22. Updating to 1.0.24 solved the issue above for me.
Another issue (still present in version 1.0.24) seems to be that TextClasDataBunch.from_df ignores the column names and simply assumes that the first column describes the classes. For example