TextList vs TextDataBunch is one preffered over another

dhiraj · November 28, 2019, 2:21pm

I would like to know the difference between creating a databunch using api’s of TextList vs TextDataBunch

Is one preffered over another?

sgugger · November 28, 2019, 2:42pm

The factory methods of TextClasDataBunch and TextLMDataBunch (note that TextDataBunch in itself is useless as it doesn’t deal with the targets appropriately) are for beginners and get your data in one line of code if it’s in a standard format.
TextList is part of the data block API which is more flexible and powerful.

dhiraj · November 28, 2019, 3:04pm

Thanks for the explanation.

The motivation behind asking this question is, I am working on a Text Classification problem and when I use TextDataBunch factory classes TextLMDataBunch and TextClasDataBunch I am getting an accuracy of ~30% and ~74% respectively for Language Model and Calssifer.
However, when i use TextList to create the databunch for Language Model and Classifer the accuracy shoots up to ~80% and ~99%.

I am not sure if the later results are proper, I would appreciate some guidance in finding out reason behind this.

Some code for reference

df_train = pd.read_csv('train.csv')
df_valid = pd.read_csv('test.csv')

#Adding a column "is_valid" to determine example belongs to validation set of not
df_train['is_valid'] = 0
df_valid['is_valid'] = 1
#combine two dataframe into one
df_combine = pd.concat([df_train, df_valid], axis=0)

data_lm_1 = TextLMDataBunch.from_df(path='/', train_df=df_train, valid_df=df_valid, bs=bs, bptt=bptt)
learn_lm_1 = language_model_learner(data_lm_1, AWD_LSTM, drop_mult=0.5).to_fp16(clip=0.1)

data_lm_2 = (TextList.from_df(df_combine)
        .split_from_df(col='is_valid')
        .label_for_lm()
        .databunch(bs=bs, bptt=bptt))
learn_lm_2 = language_model_learner(data_lm_2, AWD_LSTM).to_fp16(clip=0.1)

one more observation is that learn_lm_1 takes more time and learn_lm_2 takes comparatively less time.

sgugger · November 28, 2019, 3:14pm

This seems a bit too good to be true, so there is probably some sneaky bug. It’s hard to say where without seeing any code however.

dhiraj · November 28, 2019, 3:20pm

yes, hence i am not beliving it and trying to find out the reason

I have updated, my earlier reply to include some code.
Please let me know how can if I can share the code in some better way.

sgugger · November 28, 2019, 3:37pm

At first glance it seems like they should be the same. Did you check the first elements of each datasets (data_lm_2.train_ds[0]/data_lm_2.valid_ds[0] for instance)?

dhiraj · November 28, 2019, 3:45pm

Checked them, here they are -

print(data_lm_2.train_ds[0])
print(data_lm_2.valid_ds[0])

print(data_lm_1.train_ds[0])
print(data_lm_1.valid_ds[0])

(Text xxbos xxmaj left xxmaj ventricular xxmaj dysfunction, EmptyLabel ) 
(Text xxbos xxmaj depression, EmptyLabel )

(Text xxbos " xxmaj it has no side effect , i take it in combination of xxmaj bystolic 5 xxmaj mg and xxmaj fish xxmaj oil ", EmptyLabel ) 
(Text xxbos " i 've tried a few antidepressants over the years ( citalopram , fluoxetine , amitriptyline ) , but none of those helped with my depression , insomnia &amp; anxiety . xxmaj my doctor suggested and changed me onto 45 mg mirtazapine and this medicine has saved my life . xxmaj thankfully i have had no side effects especially the most common - weight gain , i 've actually lost alot of weight . i still have suicidal thoughts but mirtazapine has saved me . ", EmptyLabel )

sgugger · November 28, 2019, 5:00pm

Ok then the problem is clear: you didn’t pass a text column when builing your data_lm2 and it has picked the label column instead of the text column (hence the classification being so accurate!)

dhiraj · November 29, 2019, 5:08am

Thanks sgugger, appreciate you taking time to help me out.
I did the changes and now i am back to believable accuracy for language models