Training a Text Classifier

Hi - I am trying to do basic training on relatively clean DF - but for some reason, the language_model gets trained normally - but the classifier doesn’t.

Here’s my code:
dbunch_lm = TextDataLoaders.from_df(df_small, text_col='TEXT', label_col = 'ID', bs = 64)

The error:
KeyError: "Label '1112193811203' was not included in the training dataset"

Note that 1112193811203 is one of my labels.

Help please!

Any idea? Does the label need to be a string type?

The error is saying there’s a label (ID) in your (I think) validation set not present in the training set, which is not surprising since IDs are usually unique to each row.

Are you sure you’re labelling your data properly? Maybe you meant some other column?

I am not using a validation set at all. Is_valid is None

TextDataLoaders.from_df automatically creates a validation set for you, unless otherwise specified.

Yes, but how do I create a training set without a validation set? Do I have to create a validation column with boolean values?

There are multiple ways to achieve that, but I usually pass valid_pct=1e-10 because #samples * 1e-10 is basically zero.

For some reason, it keeps telling me that one of the labels is not included in the training set - although I haven’t trained anything yet. I am just trying to create a data bunch before training the model.

Does it say that when creating the DataLoaders? Would you mind sharing your code please?

I managed to bay-pass the error by including the valid_pct parameter. Thank you!
Now, for the classifier, when I run: learn.lr_find() after executing the two lines below:

learn = text_classifier_learner(dbunch_cl, AWD_LSTM, drop_mult=0.5, metrics=accuracy).to_fp16()
learn.load_encoder('fine_tuned_enc')

I get this error: TypeError: forward() missing 1 required positional argument: 'input'
Any idea?

Now getting this error: IndexError: index 3 is out of bounds for dimension 0 with size 3

What is dbunch_cl? The error means there’s likely a mistake regarding the number of classes somewhere in your code.

It’s this line:

dbunch_cl = TextDataLoaders.from_df(df_epsilon, text_col='TEXT', label_col = 'ID', is_lm = False, bs = 32, valid_pct=1e-10)

I am getting these two errors

My code is:

dbunch_lm = TextDataLoaders.from_df(df_epsilon, text_col='TEXT', label_col = 'ID', is_lm = True, bs = 64)

learn = language_model_learner(dbunch_lm, AWD_LSTM, drop_mult=0.3, metrics=[accuracy, Perplexity()]).to_fp16()

learn.fit_one_cycle(1, 1e-2, moms=(0.8,0.7,0.8))

learn.save_encoder('fine_tuned_enc')

dbunch_cl = TextDataLoaders.from_df(df_epsilon, text_col='TEXT', label_col = 'ID', is_lm = False, bs = 32, valid_pct=1e-10)

learn = text_classifier_learner(dbunch_cl, AWD_LSTM, drop_mult=0.5, metrics=accuracy).to_fp16()
learn.load_encoder('fine_tuned_enc')

learn.lr_find()

The snippet you’ve posted seems fine. What is inside your DataFrame? Also, what version of fastai are you using?

P.S: When creating dbunch_cl, you need to pass in dbunch_lm.vocab to make sure both DataLoaders have the same vocabulary.

Now, when I set the bs = 128 for both dbunch_lm and dbunch_cl, I am only getting the last error when I try to run learn.fit_one_cycle(1, 1e-2, moms=(0.8,0.7,0.8))
Error: IndexError: index 3 is out of bounds for dimension 0 with size 3

When I pass in the vocab, the error becomes: forward() missing 1 required positional argument: 'input'

My dataframe has a text column - a string, and a label, a number.

Version is ‘2.3.0’

It looks like this:

I can’t review the code right now, but will do so after the weekend as soon as possible. Hopefully your problem is solved until then!

Thanks man! I appreciate the help.

Error when I run learn.fine_tune(1) is: IndexError: index 3 is out of bounds for dimension 0 with size 3

Hopefully someone can help me out with this. Have a good weekend!