Hi, using lesson3-IMDB notebook as a reference, I am solving the Quora insincere questions challenge on Kaggle. Right now I just want to solve the problem without really worrying about the rules.
As per my understanding, I have done everything right and do get a high accuracy ~96%. However when I use it to do a single prediction from existing training data, it is failing miserably.
For e.g.:
In: learn.predict("Why are men selective?")
Out: (Category 0, tensor(0), tensor([0.6251, 0.3749]))
The complete notebook can be seen here
I havenāt done any feature engineering yet.
What could be the issue here? Pls share your thoughts/ideas.
As I wrote this perhaps one thing I need to check is the presence of insincere questions in the validation set.
Also, can we use the ClassificationInterpretation for this (text) problem? It works but docs say it should only be used for vision.
One thing I notice is that the language model isnāt working well ā the generated āquestionsā appear to be random words. Unfortunately Iām having the same problem (for a different task and when I run the imdb notebook) and donāt know how to fix it. (My own post on the problem: https://forums.fast.ai/t/troubleshooting-word-salad-output-of-text-generator/33245)
@tinhb are you getting sensible predictions from the language model?
With the constraint on compute, the usual workflow of language model then text classifier might not work here. If we want to use language modeling then it will need to be high learning rate and short, imo.
i donāt think you need to check presence of sincere samples in validation set , split step in databunch should handle it.
Most probably we are overfitting our dataset hence always predicting 1.Maybe we can increase the dropout to reduce overfiting.
Btw i was trying the same but somehow kept getting this issue and never managed to fit,
Iāve been working on this competition too. For me the bigger problem is that itās not necessarily possible to run the pipeline from the IMDB notebook with the time and compute allotted in the kernel. At best itās very slow. Additionally, the competition rules specify no outside data, so itās actually against the rules to use the wikitext-103 pretrained model. There are some whitelisted pretrained models, but I donāt know how easily they plug into our model - havenāt gotten to that yet. Iām not sure if there are workarounds to these problems or if the rules for this competition just make using fastai impractical. I ran a kernel with 5% of the data and no pretrained model, just to see if it would work at all, and I was able to get through it, but with predictably terrible results. If someone has a fastai-based kernel thatās working well I would love to see it.
Hey @nikhil_no_1
thanks for sharing your notebook. I just started looking into this competition, so itās helpful, but I canāt provide help yet.
I was wondering if you have used the embeddings that come with the Kaggle challenge?
Thanks!
Not yet. I wanted to make it work with fastai first.
I got side-tracked with other things so wasnāt able to spend any time for a month.
Hope to get back to it in a few days.
Iāve also tried to use fastai on this competition but the time and no external data constraints donāt allow to get very far. Nevertheless I reached a public LB score of 0.607 with fastai without the provided word embeddings, under the kernel time constraints. I can share the notebook if anyone is interested.
I was trying to clean the code and add some comments but it turns out fastai version on kaggle updated meanwhile. I think the version at the time it was working was 1.0.36.post1.
Basically the solution consists of:
Training a language model (not pretrained due to competition constraints). I trained on all the training data (leaving 10% for validation) for only one epoch (due to time constraint).
Then for the classification task I loaded the encoder from the language model as usual, trained for one epoch, unfreezed and trained for 2 more epochs. This step was done with only 30% of the train data (leaving 10% for validation, again due to time constraint).
Finally, find the best threshold based on validation set and create submission csv.
All this needs to run in 2h at most.
There is clearly lot of room for improvement if we remove the constraints of time and external data!
learn.predict('How to learn Chinese', n_words=30, temperature=0.75)
output: āHow to learn Chinese from China ? xxbos How do you feel about your child having a friend who talks to you ? What can you tell him about him ?ā