I´m trying to use this kaggle dataset of natural disaster from twitter for classification

juanchoalric · May 12, 2021, 9:31pm

Hi everyone! I’m currently trying to use what I learned in lesson 8 to solve a different classification problem

Just for learning purposes, I decided to rewrite the target variables to “pos” and “neg”. Being 1 pos and 0 neg.

id keyword location text target
0 1 NaN NaN Our Deeds are the Reason of this #earthquake May ALLAH Forgive us all pos
1 4 NaN NaN Forest fire near La Ronge Sask. Canada pos
2 5 NaN NaN All residents asked to ‘shelter in place’ are being notified by officers. No other evacuation or shelter in place orders are expected pos
3 6 NaN NaN 13,000 people receive #wildfires evacuation orders in California pos
4 7 NaN NaN Just got sent this photo from Ruby #Alaska as smoke from #wildfires pours into a school pos

Creating the dataloaders

dls = TextDataLoaders.from_df(df, text_col='text', is_lm=True)

dls.show_batch(max_n=3)

example below:
xxbos i entered to # win the xxup entire set of xxunk xxmaj lip xxmaj xxunk via xxunk . - xxmaj go enter ! # xxunk http : / / t.co / xxunk xxbos xxmaj to fight bioterrorism sir . xxbos xxmaj emergency xxmaj response and xxmaj hazardous xxmaj chemical xxmaj management : xxmaj xxunk and xxmaj xxunk http : / / t.co / xxunk http : / / t.co / xxunk

I can notice a lot of xxunk tags, Im not sure if there is a way to reduce the amount. In the example for the movie reviews, it was much lower.
Then I create the learner

learn = language_model_learner(

    dls, AWD_LSTM, drop_mult=0.3,

    metrics=[accuracy, Perplexity()]

).to_fp16()

and fine_tune one epoch

learn.fine_tune(1, 2e-2)

The result was pretty good

|accuracy|
0.445406

Then I create the Datablock using this language model

class_data = DataBlock(

   blocks=(TextBlock.from_df('text', vocab=dls.vocab), CategoryBlock),

    get_x=ColReader('text'), get_y=ColReader('label'), splitter=RandomSplitter(0.2))

Now I only need to create the dataloaders

dls_class = class_data.dataloaders(df)

Create a new learner

learn2 = text_classifier_learner(dls_class, AWD_LSTM, drop_mult=0.5, metrics=accuracy).to_fp16()

And fit one cycle

learn2.fit_one_cycle(1, 0.01)

This gave me an accuracy of 0.772668

So I decided to continue and slowly unfreeze each layer of the model

Once i unfreezed the 3rd last layer I decided to unfreeze the hole thing

learn2.unfreeze()
learn2.fit_one_cycle(2, slice(1e-3/(2.6**4), 1e-3))

And the accuracy also got better 0.806833

But when I try to test it with unseen data is where the problem began

texts = list(df_test["text"])

    preds = []
    for text in texts:
      preds.append(learn2.predict([text]))

[('neg', tensor(0), tensor([0.9354, 0.0646])),
 ('neg', tensor(0), tensor([0.9354, 0.0646])),
 ('neg', tensor(0), tensor([0.9354, 0.0646])),
 ('neg', tensor(0), tensor([0.9354, 0.0646])),
 ('neg', tensor(0), tensor([0.9354, 0.0646])),
 ('neg', tensor(0), tensor([0.9354, 0.0646])),
 ('neg', tensor(0), tensor([0.9354, 0.0646])),
 ('neg', tensor(0), tensor([0.9354, 0.0646])),
 ('neg', tensor(0), tensor([0.9354, 0.0646])),
 ('neg', tensor(0), tensor([0.9354, 0.0646])),
 ('neg', tensor(0), tensor([0.9354, 0.0646])),
 ('neg', tensor(0), tensor([0.9354, 0.0646])),
 ('neg', tensor(0), tensor([0.9354, 0.0646])),
 ('neg', tensor(0), tensor([0.9354, 0.0646])),
 ('neg', tensor(0), tensor([0.9354, 0.0646])),
 ('neg', tensor(0), tensor([0.9354, 0.0646])),
 ('neg', tensor(0), tensor([0.9354, 0.0646])),
 ('neg', tensor(0), tensor([0.9354, 0.0646])),
 ('neg', tensor(0), tensor([0.9354, 0.0646])),
 ('neg', tensor(0), tensor([0.9354, 0.0646])),
 ('neg', tensor(0), tensor([0.9354, 0.0646])),
 ('neg', tensor(0), tensor([0.9354, 0.0646])),
 ('neg', tensor(0), tensor([0.9354, 0.0646])),
 ('neg', tensor(0), tensor([0.9354, 0.0646])),
 ('neg', tensor(0), tensor([0.9354, 0.0646])),
 ('neg', tensor(0), tensor([0.9354, 0.0646])),
 ('neg', tensor(0), tensor([0.9354, 0.0646])),
 ('neg', tensor(0), tensor([0.9354, 0.0646])),
 ('neg', tensor(0), tensor([0.9354, 0.0646])),
 ('neg', tensor(0), tensor([0.9354, 0.0646])),
 ('neg', tensor(0), tensor([0.9354, 0.0646])),
 ('neg', tensor(0), tensor([0.9354, 0.0646])),
 ('neg', tensor(0), tensor([0.9354, 0.0646])),
 ('neg', tensor(0), tensor([0.9354, 0.0646])),
 ('neg', tensor(0), tensor([0.9354, 0.0646])),
 ('neg', tensor(0), tensor([0.9354, 0.0646])),
 ('neg', tensor(0), tensor([0.9354, 0.0646])),
 ('neg', tensor(0), tensor([0.9354, 0.0646])),
 ('neg', tensor(0), tensor([0.9354, 0.0646])),
 ('neg', tensor(0), tensor([0.9354, 0.0646])),
 ('neg', tensor(0), tensor([0.9354, 0.0646])),
 ('neg', tensor(0), tensor([0.9354, 0.0646])),
 ('neg', tensor(0), tensor([0.9354, 0.0646])),
 ('neg', tensor(0), tensor([0.9354, 0.0646])),
 ('neg', tensor(0), tensor([0.9354, 0.0646])),
 ('neg', tensor(0), tensor([0.9354, 0.0646])),

All the preds give me the same thing… Im not sure where im making the mistake.

VishnuSubramanian · May 13, 2021, 1:24am

Hi, I just tried to run a sanity check on the data that you mentioned with the below code.

df = pd.read_csv('train.csv')
dls = TextDataLoaders.from_df(df, text_col='text', is_lm=True)
dls.show_batch(max_n=3)
learn = language_model_learner(dls, AWD_LSTM, drop_mult=0.3,metrics=[accuracy, Perplexity()]).to_fp16()
learn.fine_tune(1, 2e-2)
learn.fine_tune(5, 2e-2)
class_data = DataBlock(blocks=(TextBlock.from_df('text', vocab=dls.vocab), CategoryBlock),get_x=ColReader('text'), get_y=ColReader('target'), splitter=RandomSplitter(0.2))
dls_class = class_data.dataloaders(df)
learn2 = text_classifier_learner(dls_class, AWD_LSTM, drop_mult=0.5, metrics=accuracy).to_fp16()
learn2.fit_one_cycle(1, 0.01)

Post this I tried it on test data, it does not give random predictions.

test_df = pd.read_csv('test.csv')

preds = [learn2.predict(o) for o in list(test_df.text.values)[:100]]

[('0', tensor(0), tensor([0.5898, 0.4102])),
 ('1', tensor(1), tensor([0.4133, 0.5867])),
 ('1', tensor(1), tensor([0.2071, 0.7929])),
 ('1', tensor(1), tensor([0.4660, 0.5340])),
 ('1', tensor(1), tensor([0.0378, 0.9622])),
 ('0', tensor(0), tensor([0.5982, 0.4018])),
 ('0', tensor(0), tensor([0.7736, 0.2264])),
 ('0', tensor(0), tensor([0.8643, 0.1357])),
 ('0', tensor(0), tensor([0.8536, 0.1464])),
 ('0', tensor(0), tensor([0.7704, 0.2296])),
 ('0', tensor(0), tensor([0.9391, 0.0609])),

I would recommend few things to try,

You are not using the LM model that you learnt. learn
You have not mentioned how did you get predictions.
Try not to unfreeze the model and run predictions.

I hope it helps to solve your issue.

juanchoalric · May 13, 2021, 1:57am

Ohh thanks! I don’t know why in my case it gives only 0 as a prediction.

Responding to the consideration listed above:

Why I’m not using the LM that I created? When I built the DataBlock I passed “vocab=dls.vocab”
Sorry my bad
preds = []
for text in texts:
preds.append(learn2.predict([text]))
I will try it without unfreezing the model.

Thank you so much!

VishnuSubramanian · May 13, 2021, 2:39am

To use the LM, you also have to save the encoder of the LM and load it for the text classifier. I am not sure if you have done that too.

learn.save_encoder('finetuned') -----

class_data = DataBlock(blocks=(TextBlock.from_df('text', vocab=dls.vocab), CategoryBlock),get_x=ColReader('text'), get_y=ColReader('target'), splitter=RandomSplitter(0.2))
dls_class = class_data.dataloaders(df)
learn2 = text_classifier_learner(dls_class, AWD_LSTM, drop_mult=0.5, metrics=accuracy).to_fp16()

learn2 = learn2.load_encoder('finetuned') ----

juanchoalric · May 13, 2021, 2:44am

Why did you create it after declaring the text_classifier_learner? Could you declare it before?

VishnuSubramanian · May 13, 2021, 2:48am

I assume you are talking about this line.

learn2.load_encoder('finetuned')

You have to do this post creating the learner, as you are just modifying the encoder. It is not declaring, it is more of updating the weights of the encoder. The model has 2 primary components, the encoder, and the decoder. We are just borrowing the encoder from the LM.

juanchoalric · May 13, 2021, 2:50am

Awesome thank you so much