Training a Text Classifier

youcefjd · April 2, 2021, 12:13am

Hi - I am trying to do basic training on relatively clean DF - but for some reason, the language_model gets trained normally - but the classifier doesn’t.

Here’s my code:
dbunch_lm = TextDataLoaders.from_df(df_small, text_col='TEXT', label_col = 'ID', bs = 64)

The error:
KeyError: "Label '1112193811203' was not included in the training dataset"

Note that 1112193811203 is one of my labels.

Help please!

youcefjd · April 2, 2021, 11:43am

Any idea? Does the label need to be a string type?

BobMcDear · April 2, 2021, 11:56am

The error is saying there’s a label (ID) in your (I think) validation set not present in the training set, which is not surprising since IDs are usually unique to each row.

Are you sure you’re labelling your data properly? Maybe you meant some other column?

youcefjd · April 2, 2021, 12:07pm

I am not using a validation set at all. Is_valid is None

BobMcDear · April 2, 2021, 12:14pm

TextDataLoaders.from_df automatically creates a validation set for you, unless otherwise specified.

youcefjd · April 2, 2021, 12:16pm

Yes, but how do I create a training set without a validation set? Do I have to create a validation column with boolean values?

BobMcDear · April 2, 2021, 12:25pm

There are multiple ways to achieve that, but I usually pass valid_pct=1e-10 because #samples * 1e-10 is basically zero.

youcefjd · April 2, 2021, 1:51pm

For some reason, it keeps telling me that one of the labels is not included in the training set - although I haven’t trained anything yet. I am just trying to create a data bunch before training the model.

BobMcDear · April 2, 2021, 3:56pm

Does it say that when creating the DataLoaders? Would you mind sharing your code please?

youcefjd · April 2, 2021, 4:36pm

I managed to bay-pass the error by including the valid_pct parameter. Thank you!
Now, for the classifier, when I run: learn.lr_find() after executing the two lines below:

learn = text_classifier_learner(dbunch_cl, AWD_LSTM, drop_mult=0.5, metrics=accuracy).to_fp16()
learn.load_encoder('fine_tuned_enc')

I get this error: TypeError: forward() missing 1 required positional argument: 'input'
Any idea?

youcefjd · April 2, 2021, 4:56pm

Now getting this error: IndexError: index 3 is out of bounds for dimension 0 with size 3

BobMcDear · April 2, 2021, 5:16pm

What is dbunch_cl? The error means there’s likely a mistake regarding the number of classes somewhere in your code.

youcefjd · April 2, 2021, 6:00pm

It’s this line:

dbunch_cl = TextDataLoaders.from_df(df_epsilon, text_col='TEXT', label_col = 'ID', is_lm = False, bs = 32, valid_pct=1e-10)

youcefjd · April 2, 2021, 6:40pm

I am getting these two errors

youcefjd · April 2, 2021, 6:42pm

My code is:

dbunch_lm = TextDataLoaders.from_df(df_epsilon, text_col='TEXT', label_col = 'ID', is_lm = True, bs = 64)

learn = language_model_learner(dbunch_lm, AWD_LSTM, drop_mult=0.3, metrics=[accuracy, Perplexity()]).to_fp16()

learn.fit_one_cycle(1, 1e-2, moms=(0.8,0.7,0.8))

learn.save_encoder('fine_tuned_enc')

dbunch_cl = TextDataLoaders.from_df(df_epsilon, text_col='TEXT', label_col = 'ID', is_lm = False, bs = 32, valid_pct=1e-10)

learn = text_classifier_learner(dbunch_cl, AWD_LSTM, drop_mult=0.5, metrics=accuracy).to_fp16()
learn.load_encoder('fine_tuned_enc')

learn.lr_find()

BobMcDear · April 2, 2021, 8:02pm

The snippet you’ve posted seems fine. What is inside your DataFrame? Also, what version of fastai are you using?

P.S: When creating dbunch_cl, you need to pass in dbunch_lm.vocab to make sure both DataLoaders have the same vocabulary.

youcefjd · April 2, 2021, 8:10pm

Now, when I set the bs = 128 for both dbunch_lm and dbunch_cl, I am only getting the last error when I try to run learn.fit_one_cycle(1, 1e-2, moms=(0.8,0.7,0.8))
Error: IndexError: index 3 is out of bounds for dimension 0 with size 3

When I pass in the vocab, the error becomes: forward() missing 1 required positional argument: 'input'

My dataframe has a text column - a string, and a label, a number.

Version is ‘2.3.0’

youcefjd · April 2, 2021, 8:11pm

It looks like this:

BobMcDear · April 2, 2021, 8:16pm

I can’t review the code right now, but will do so after the weekend as soon as possible. Hopefully your problem is solved until then!

youcefjd · April 2, 2021, 8:17pm

Thanks man! I appreciate the help.

Error when I run learn.fine_tune(1) is: IndexError: index 3 is out of bounds for dimension 0 with size 3

Hopefully someone can help me out with this. Have a good weekend!