KeyError with text classifier dataloaders

Hi all,

I’m trying to make a text classifier using the two-step process of fine tuning a language model on a pre-trained net then building the classifier on top of that. My problem: I get a KeyError when building the dataloaders for the classifier if I use more than about 1000 texts. With a small set of texts I get no error; everything looks like the examples I’m following in the book. But with my real data set, It pukes.

Any ideas where I should look? The data is in a pandas dataframe. Here’s some basic info and code I’m trying to run.

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1000 entries, 4003 to 5002
Data columns (total 4 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   PackingListID   1000 non-null   int64 
 1   Classification  1000 non-null   int64 
 2   Chemicals       1000 non-null   object
 3   is_valid        1000 non-null   bool  
dtypes: bool(1), int64(2), object(1)
memory usage: 32.2+ KB

**** Language Model Loader *****
dls_lm= DataBlock(

***** Classifier Loader **** This one makes the KeyError

1 Like

Hey, I solved my problem.

The problem was with the CategoryBlock - I had a number of categories that only had one example or just a few. When I removed the low frequency categories I could make the block with 50,000 examples and no errors.

1 Like

Hi, I’m running into a similar issue (same use case - a text classifier). In my case, I have many labels with just a few samples. Like you, I’ve tried removing the samples with categories under-represented but it also means I will never be able to have prediction with these labels.

One thing I have done is to make sure that I have samples with a given output in the training and validation set (using train_test_split with the stratify option). Still, I ran into the issue when I decrease the size of the training set. I’m guessing the way the data is fed to the learner in batches can cause the issue where the validation labels is not present in the training set. I’m not sure about that.

I’m thinking maybe another way to fix this would be to oversample these under-represented samples.

I think any kind of classifier wouldn’t work well if there weren’t class examples in both the training and validation set.

If by “oversampling” you mean add more examples from the infrequent class, even if that’s disproportionate to the actual population, that sounds like the best plan. If you mean use multiple copies of the same example, I think you might end up fooling yourself.

Two other things you might try:

  1. Augment your data with slightly modified versions of the low frequency class. You’d have to be the judge of how to do that with your texts while maintaining class membership

  2. Combine your low-frequency classes into a “misc” class and use another means to classify them if they show up in a prediction.

I’m not an expert, but just some ideas.