Text Classification - unexpected very low performance?

Hello. I’m trying to create a classifier that categorizes different Yelp reviews about career coaching businesses.

I’ve manually classified about 100 reviews into 5 relatively even classes: career change, current job, interview, lost job, and resume.

I trained the language model, loaded the encoder, created the data using the following code with the datablock API:

data_clas = (TextList.from_df(df=clas_df, path=path)

Split by idx was just to try to get a nice, even validation set. I tried it with split_by_rand_pct as well.

For reference, here’s what the LR finder looks like:

I then trained the classifier using the code and numbers on the iMDB notebook. The results were very poor. It would often start out by making all its predictions in a single class (and the train_loss would often already be lower than the validation loss on the very first training cycle).

Beyond that, the train_loss continued to decrease while the valid_loss stayed the same or slightly increased/decreased and the accuracy stayed at around 15%-30%. A lot of the time, when the accuracy increased, the valid_loss had increased as well.

Is there something that I’m doing wrong here? Or is this just the result of having a small amount of data? I tried splitting each review in half to effectively double the sample size but was still only able to get up to 40%.

There’s obviously the possibility that there’s just not enough of a correlation with the text and the classes, but intuitively I’d think the model would do better than it’s doing. Career change reviews mention “career change”, “new career”, “transition”; resume reviews have the word “resume” many times, interview reviews have “interview” a lot, etc.

Here are some examples of out-of-sample reviews that would be classified as resume:

this company spent a lot of time with me throughout the resume drafting process. my resume hadn't been updated since my first job as a teacher so it needed a lot of help! they asked quite a few questions to determine which content was most important for my resume. i was very pleased with the draft they sent for my review. they were able to update my resume to a modern look-and-feel and highlight my top attributes. i will definitely use them again for other services!

Prediction: career change, 0.8088 (resume 0.0996)

after getting free resume critiques from several companies i went with executive drafts as theirs was spot on. i was with my first job after college for 15 years and had never had to do a resume before so i needed a lot of help. also, as the first employee of a startup my role was often not defined. hazel had her work cut out for her and she did not disappoint. my new resume got noticed immediately and i was able to land a job not long after. i have also referred several friends who had success with them as well.

Prediction: lost job, 0.4415 (resume 0.2103)

I feel like there has to be a way to make this better; these predictions seem way off.

Any advice would be much appreciated.

One side note: interp.plot_top_losses throws the error
AttributeError: ‘ClassificationInterpretation’ object has no attribute ‘plot_top_losses’
for some reason.

Hi Ryan,

You need to use the vocab from your language model when creating the classification databunch. You can simply pass it in like this: TextList.from_df(df=clas_df, path=path, vocab=data_lm.vocab) where data_lm is the databunch you used for language model fine-tuning.


Here’s the full code I used when creating the classifier:

data_clas = (TextList.from_df(df=clas_df, path=path)
         .split_by_rand_pct(0.3, seed=9)

data_clas.vocab.itos = data_lm.vocab.itos
learn = text_classifier_learner(data_clas, AWD_LSTM, drop_mult=0.5)

I tried removing the data_clas.vocab.itos = data_lm.vocab.itos line and instead assigned the vocab within the databunch like you suggested; I got much more normal results (45% accuracy for when the data was split in half, and 60% normally).

Split data:

Normal data:

These confusion matrices look much more normal and less erratic than the ones I was getting before. Just curious, why doesn’t the code that I was using accomplish the same thing?

Additionally, any insight about how to improve the accuracy of the current job and lost job classes? I think they’re the harder ones to identify so I’m not surprised they have lower accuracy, and there are only 15-16 records in each of those classes as compared to ~20 for the rest.

Finally, do you know how to fix interp.plot_top_losses() not working? it would be helpful in identifying what the classifier is confusing heavily so I can try improving data quality.

Thank you for your help.

1 Like

I’m not entirely sure what’s going on exactly behind the scenes when you pass the vocab to the new databunch but here is what I suspect is going on.

Glad that it helped :slight_smile:

It’s always a good sign if your intuition about the data is in line with the model’s behavior. I would go ahead and label some more data, especially since you currenlty have less data for the harder classes. Then you can re-train the model and see if it improves.

Nope, sorry. Maybe someone else knows. But it would help to post the error message.

Make sure you are running this line before:
interp = Interpretation.from_learner(learn)