NLP challenge project

KevinB · June 14, 2019, 4:28pm

I wrote a blog post to discuss some of the lessons learned from the Haha 2019 challenge.

bfarzin · June 14, 2019, 6:07pm

I love the autodating idea. I have that problem also where I put a date in the name 0609 then run it for 5 days. I got saved a few times by checking in nearly everything I did with git and then being able to roll back or checkout old versions and see what changed.

Thanks for posting this blog. Really helpful for others to see!

KevinB · June 14, 2019, 6:12pm

Thanks for the feedback @bfarzin. Hopefully it helps you with future projects

bfarzin · June 28, 2019, 8:58pm

I followed @kevinb and @hiromi lead and posted my own blog about the work done on this challenge. Open to any and all feedback. DM me.

wgpubs · July 2, 2019, 6:31pm

Great job Bobak! Followed everyone’s progress on this thread and was really impressed by the results the fastai folks were able to achieve.

bfarzin · July 6, 2019, 8:43pm

In an effort to make what I learned even more distilled, I prepared this slide-deck. Feedback always welcome about how I could make it better!

heye0507 · September 17, 2019, 6:38pm

Hi Bfarzin,

Thank you very much for explaining what you have done for the competition. I am currently working on similar things (tweets sentiment analysis for individual user level, which means the company doesn’t just want a universal model, but they want to have user A - model A, user B model B).

As you can see, the dataset from each users will be highly imbalanced. I am also looking for ways to handle the class imbalance problem. As the one I am working on is over-sampling.

I read your paper and dig a bit into your repo, but I couldn’t find the SMOTE sampling implementation. Would you mind share the sampling algorithm so I can give a shot?

Thanks in advance, and I will definitely try sentence piece tokenizer later to see if I can get a better result (Also follow what you did with tweepy)

Best,

bfarzin · September 18, 2019, 3:46pm

The code snippet is right here. In words:

I assume that the negative case is less frequent.
Split the DataFrame into postiive and negative cases
Resample the negative cases (with replacement) up to the length of the positive cases.
Concat the two DataFrames together into one dataset with 50% positive/negative split.

There are other ways to approach this including class weights on the loss function. And an oversampling callback described here and implemented in thelibrary here.

Lots of good options! Let us know what works for you and how you progress on your project.

heye0507 · September 18, 2019, 4:22pm

Hi Bfarzin,

Thanks for your reply.

They all look good! Just to make sure, this is the oversampling for minority class right? I previously thought SMOTE oversampling is applied at feature space (as my yeaterday’s research online)

Weighted class is on my TODO list for today.

The project I’m having here is like a toy project for company recruiting, but as I’m more and more dig inside the idea that builds individual level sentiment classifier, I really start to feel it won’t work…

My approach is fine tuning language model on both awd_lstm and BERT( two different encoders), then build a general classifier with 70% of data, then feeds user level dataset one by one to fine-tune classifier(the idea is that decoder is warmed up so I only need 1 epoch to fine tune user level)

However, each user dataset are so small with only 100 messages, most of them are 96 likes, and 4 dislikes. As you can tell, the individual level classifier accuracy jumps between 50% to 100%…

All the tricks I’m applying seems only helping the general classifier, but not individual ones (again, tiny data…)

So I start to think maybe the real world project they have is not individual classifier, they are building a feature extractor for each user…

Again, really appreciated your input.

Best,

bfarzin · September 18, 2019, 4:53pm

I don’t understand how this would work in the feature space. My understanding is that SMOTE is used with the raw data (or at the batch level if you want to get really cute with it) so that you get balanced training. Maybe I have misunderstood it.

heye0507 · September 18, 2019, 5:04pm

I read this article, you can crtl+F search “SMOTE” section (sorry don’t know how I can create a hyper link so you can jump directly to the topic)

And also read things wrote in this kernel
https://www.kaggle.com/rafjaa/resampling-strategies-for-imbalanced-datasets

But I might be wrong, as I switched to the NLP last week. Most of my understanding are coming from fastai.vision…

Even Jeremy said that all the tricks we did for part 2 also apply to NLP, I found that I have a very hard time to understand the NLP input / output in the beginning Also, thinking through LSTM gates are not as easy as receptive field in the CV models…

heye0507 · September 18, 2019, 5:08pm

But I really don’t feel very comfortable using oversampling in NLP.

You have to do it after split your data, otherwise you might end up with train data in valid. (Yes it happens to CV as well, but we have argumentation)
You can’t really change oversampled text, but for computer vision, you can rotate / flip / lighting change … etc. Inputs end up look different. My feeling is that if you oversampling in NLP, you are prone to have overfit.

bfarzin · September 18, 2019, 5:34pm

I had not seen SMOTE used that way in the past with changing the feature vectors. That is an interesting approach.

You could be confusing SMOTE and transformations of data to boost your data set. (or maybe I am confusing the two!)

SMOTE is re-sampling from the original data to balance classes. If you imagine a case where you have 75% one class, 25% second class your prediction can be just class 0 and you will get 75% correct and you might not get a gradient that can push you to even look at the second class. If you re-sample from the second class so you present the training data with a 50/50 split, you can then get meaningful gradients and learn a model that can discriminate the two cases out-of-sample.

Transformations will boost your training data. In computer vision, these are clear flip/rotate transforms and more. They will take your small amount of data and create a “bigger” data set.

There are ways to increase your data set with “back-translation” but I have yet to get that working myself. Maybe you have been thinking about transformations of the data all along?

heye0507 · September 18, 2019, 5:45pm

Thanks for all the inputs.

I will report back how things going for my case both for awd_lstm and BERT in fastai

At the mean while, study v2 walk thru as much as possible

ilovescience · September 19, 2019, 5:04am

Reading this thread, I want to comment on a couple things.

The Oversampling Callback in the library oversamples all the classes to the same level as the majority class.

The oversampling callback does this automatically.

There are augmentations for NLP also. For example, you can think about replacing certain words with synonyms. See here. Some of the image augmentations might also work on the embeddings as well. For example, mixup can be used for NLP (here, and also mentioned in the forums). So if overfitting is a worry, there are options!

heye0507 · September 19, 2019, 5:11am

wow, I didn’t know that mixup can be applied to NLP.

Thank you so much! I will take a look.

Currently I don’t have a way to show if my approach is in the right direction, since my model can easily predict accuracy above 90% (yes if you are around me you will hear me complain that I don’t have a LB to check…)

My crazy idea is, grab my LSTM language model, grab the message I need to classify. Instead of getting the accuracy, like what they showed me. Do a PCA, map the embedding to 2d space. Grab other people’s embedding, map to 2d space. Compare the two distance. (I was gonna run similarity check for the two embedding, but I have different size… I think PCA should just work fine)

But thanks for the input

ilovescience · September 19, 2019, 5:26am

But why would you expect the embeddings to be close in distance? I would think they would only be close if you are using the same architecture, training very similar, etc. Otherwise, your model could pick up different things and generate different embeddings. Sorry I am not an expert in NLP so maybe it’s true that there are some features of the English language that might lead to generally similar embeddings.

heye0507 · September 19, 2019, 5:41am

I am not an expert in NLP, and I actually picked up NLP last week. Before that, I only trained IMDB and wrote RNNs to predict next number… (plus watching twice about seq2seq, you know what I am talking about :))

Here is the question, it is not kaggle, I don’t have defined metrics, I don’t have LB. How can I prove that my model is doing well? Base on accuracy, they used BERT model has similar accuracy as mine(but such small datasets with high imbalanced data, accuracy is really not a good metric to use) But they use BERT… How do you convince people that LSTM is doing the same thing as BERT?

Therefore, I have this idea, if I can get the small dataset cross BERT, grab its output embedding. Use the same dataset cross the LSTM I built, grab its output embedding. Now I have emb_A and emb_B for the same sentence. If I do PCA, can I prove this two embeddings are similar (Maybe my understanding about PCA is wrong). If yes, can I now say the LSTM is at least as good as BERT?

The idea comes from if I put one image cross resnet 50, one cross resnet 101, if I draw the heat map, can I see two heat maps focusing in the similar area?
If yes, my resnet 50 is capturing at least as good as resnet 101. Yes?

ilovescience · September 19, 2019, 5:50am

Then do the same for text! Show that the LSTM is focusing on similar parts of the text as the BERT model. fastai already has this implemented for the AWD-LSTM over here so you might be able to use the source code for inspiration. Unfortunately they mention in the docs:

This was designed for AWD-LSTM only for the moment, because Transformer already has its own attentional model.

So you will have to write separate code for the transformer. Does this seem like something reasonable for your usecase?

heye0507 · September 19, 2019, 6:01am

wow, I can’t say how much I learned just by talking to you guys. I really appreciated the inputs.

And yes, I still straggle to understand attention and transformer (they are my blocks because I have not yet implemented seq2seq myself…). Therefore I wouldn’t be able to think through now for both of them… But I get the idea.

My question is, if text_1 cross both BERT and LSTM, grab the finally hidden state before pooling (I shouldn’t say embedding in the first place, or it is called embedding…), for these two things if I just do PCA analysis, will I see they actually close? PCA is capturing the most important things (vectors?) right? Anyway, people also pointed out I should know how to draw TF-IDF, word2vec, or PCA for data exploration…

Or I should say this,

If I let dataset_1 cross BERT, grab each of the feature embeddings (hidden state before classification head), do PCA on them (say PCA_result_1)
let same dataset_1 cross LSTM, grab each of the feature embeddings (hidden state before classification head), do PCA on them (say PCA_result_2)
compare these two PCA results, if BERT PCA distribution is similar as LSTM distribution, then classifier should perform similar?