Thanks for the feedback @bfarzin. Hopefully it helps you with future projects
Great job Bobak! Followed everyone’s progress on this thread and was really impressed by the results the fastai folks were able to achieve.
In an effort to make what I learned even more distilled, I prepared this slide-deck. Feedback always welcome about how I could make it better!
Thank you very much for explaining what you have done for the competition. I am currently working on similar things (tweets sentiment analysis for individual user level, which means the company doesn’t just want a universal model, but they want to have user A - model A, user B model B).
As you can see, the dataset from each users will be highly imbalanced. I am also looking for ways to handle the class imbalance problem. As the one I am working on is over-sampling.
I read your paper and dig a bit into your repo, but I couldn’t find the SMOTE sampling implementation. Would you mind share the sampling algorithm so I can give a shot?
Thanks in advance, and I will definitely try sentence piece tokenizer later to see if I can get a better result (Also follow what you did with tweepy)
The code snippet is right here. In words:
- I assume that the negative case is less frequent.
- Split the DataFrame into postiive and negative cases
- Resample the negative cases (with replacement) up to the length of the positive cases.
- Concat the two DataFrames together into one dataset with 50% positive/negative split.
Lots of good options! Let us know what works for you and how you progress on your project.
Thanks for your reply.
They all look good! Just to make sure, this is the oversampling for minority class right? I previously thought SMOTE oversampling is applied at feature space (as my yeaterday’s research online)
Weighted class is on my TODO list for today.
The project I’m having here is like a toy project for company recruiting, but as I’m more and more dig inside the idea that builds individual level sentiment classifier, I really start to feel it won’t work…
My approach is fine tuning language model on both awd_lstm and BERT( two different encoders), then build a general classifier with 70% of data, then feeds user level dataset one by one to fine-tune classifier(the idea is that decoder is warmed up so I only need 1 epoch to fine tune user level)
However, each user dataset are so small with only 100 messages, most of them are 96 likes, and 4 dislikes. As you can tell, the individual level classifier accuracy jumps between 50% to 100%…
All the tricks I’m applying seems only helping the general classifier, but not individual ones (again, tiny data…)
So I start to think maybe the real world project they have is not individual classifier, they are building a feature extractor for each user…
Again, really appreciated your input.
I don’t understand how this would work in the feature space. My understanding is that SMOTE is used with the raw data (or at the batch level if you want to get really cute with it) so that you get balanced training. Maybe I have misunderstood it.
I read this article, you can crtl+F search “SMOTE” section (sorry don’t know how I can create a hyper link so you can jump directly to the topic)
And also read things wrote in this kernel
But I might be wrong, as I switched to the NLP last week. Most of my understanding are coming from fastai.vision…
Even Jeremy said that all the tricks we did for part 2 also apply to NLP, I found that I have a very hard time to understand the NLP input / output in the beginning Also, thinking through LSTM gates are not as easy as receptive field in the CV models…
But I really don’t feel very comfortable using oversampling in NLP.
You have to do it after split your data, otherwise you might end up with train data in valid. (Yes it happens to CV as well, but we have argumentation)
You can’t really change oversampled text, but for computer vision, you can rotate / flip / lighting change … etc. Inputs end up look different. My feeling is that if you oversampling in NLP, you are prone to have overfit.
I had not seen SMOTE used that way in the past with changing the feature vectors. That is an interesting approach.
You could be confusing SMOTE and transformations of data to boost your data set. (or maybe I am confusing the two!)
SMOTE is re-sampling from the original data to balance classes. If you imagine a case where you have 75% one class, 25% second class your prediction can be just class 0 and you will get 75% correct and you might not get a gradient that can push you to even look at the second class. If you re-sample from the second class so you present the training data with a 50/50 split, you can then get meaningful gradients and learn a model that can discriminate the two cases out-of-sample.
Transformations will boost your training data. In computer vision, these are clear flip/rotate transforms and more. They will take your small amount of data and create a “bigger” data set.
There are ways to increase your data set with “back-translation” but I have yet to get that working myself. Maybe you have been thinking about transformations of the data all along?
Thanks for all the inputs.
I will report back how things going for my case both for awd_lstm and BERT in fastai
At the mean while, study v2 walk thru as much as possible
Reading this thread, I want to comment on a couple things.
The Oversampling Callback in the library oversamples all the classes to the same level as the majority class.
The oversampling callback does this automatically.
There are augmentations for NLP also. For example, you can think about replacing certain words with synonyms. See here. Some of the image augmentations might also work on the embeddings as well. For example, mixup can be used for NLP (here, and also mentioned in the forums). So if overfitting is a worry, there are options!
wow, I didn’t know that mixup can be applied to NLP.
Thank you so much! I will take a look.
Currently I don’t have a way to show if my approach is in the right direction, since my model can easily predict accuracy above 90% (yes if you are around me you will hear me complain that I don’t have a LB to check…)
My crazy idea is, grab my LSTM language model, grab the message I need to classify. Instead of getting the accuracy, like what they showed me. Do a PCA, map the embedding to 2d space. Grab other people’s embedding, map to 2d space. Compare the two distance. (I was gonna run similarity check for the two embedding, but I have different size… I think PCA should just work fine)
But thanks for the input
But why would you expect the embeddings to be close in distance? I would think they would only be close if you are using the same architecture, training very similar, etc. Otherwise, your model could pick up different things and generate different embeddings. Sorry I am not an expert in NLP so maybe it’s true that there are some features of the English language that might lead to generally similar embeddings.
I am not an expert in NLP, and I actually picked up NLP last week. Before that, I only trained IMDB and wrote RNNs to predict next number… (plus watching twice about seq2seq, you know what I am talking about :))
Here is the question, it is not kaggle, I don’t have defined metrics, I don’t have LB. How can I prove that my model is doing well? Base on accuracy, they used BERT model has similar accuracy as mine(but such small datasets with high imbalanced data, accuracy is really not a good metric to use) But they use BERT… How do you convince people that LSTM is doing the same thing as BERT?
Therefore, I have this idea, if I can get the small dataset cross BERT, grab its output embedding. Use the same dataset cross the LSTM I built, grab its output embedding. Now I have emb_A and emb_B for the same sentence. If I do PCA, can I prove this two embeddings are similar (Maybe my understanding about PCA is wrong). If yes, can I now say the LSTM is at least as good as BERT?
The idea comes from if I put one image cross resnet 50, one cross resnet 101, if I draw the heat map, can I see two heat maps focusing in the similar area?
If yes, my resnet 50 is capturing at least as good as resnet 101. Yes?
Then do the same for text! Show that the LSTM is focusing on similar parts of the text as the BERT model. fastai already has this implemented for the AWD-LSTM over here so you might be able to use the source code for inspiration. Unfortunately they mention in the docs:
This was designed for AWD-LSTM only for the moment, because Transformer already has its own attentional model.
So you will have to write separate code for the transformer. Does this seem like something reasonable for your usecase?
wow, I can’t say how much I learned just by talking to you guys. I really appreciated the inputs.
And yes, I still straggle to understand attention and transformer (they are my blocks because I have not yet implemented seq2seq myself…). Therefore I wouldn’t be able to think through now for both of them… But I get the idea.
My question is, if text_1 cross both BERT and LSTM, grab the finally hidden state before pooling (I shouldn’t say embedding in the first place, or it is called embedding…), for these two things if I just do PCA analysis, will I see they actually close? PCA is capturing the most important things (vectors?) right? Anyway, people also pointed out I should know how to draw TF-IDF, word2vec, or PCA for data exploration…
Or I should say this,
If I let dataset_1 cross BERT, grab each of the feature embeddings (hidden state before classification head), do PCA on them (say PCA_result_1)
let same dataset_1 cross LSTM, grab each of the feature embeddings (hidden state before classification head), do PCA on them (say PCA_result_2)
compare these two PCA results, if BERT PCA distribution is similar as LSTM distribution, then classifier should perform similar?
Thinking about this further, if you generate embeddings for the words in the dataset for your different models, and plot them with a dimensionality reduction technique, we should see decent clustering in both cases. I think this is probably most true for simple problems like sentiment classification were there are only a couple classes, and we know what types of words are positive and negative. So your embedding approach might also work.
Yes… That’s exactly what I am trying to prove…
I got feedback that I am giving out dense of information, therefore I am trying to compare things in the domain they are most comfortable with…
As you mentioned, in the simple 0,1 case, we probably just see a very decent clustering distribution which might not extend to any further use cases… But if I can use it to let people start to accept what I said might be worth a try?
Explain things to non-technical people is what I have heard a lot… and clearly I failed last Friday being asked what is AWD_LSTM, AWD stands for… (Now I know…)