Hi, using lesson3-IMDB notebook as a reference, I am solving the Quora insincere questions challenge on Kaggle. Right now I just want to solve the problem without really worrying about the rules.
As per my understanding, I have done everything right and do get a high accuracy ~96%. However when I use it to do a single prediction from existing training data, it is failing miserably.
In: learn.predict("Why are men selective?")
Out: (Category 0, tensor(0), tensor([0.6251, 0.3749]))
The complete notebook can be seen here
I haven’t done any feature engineering yet.
What could be the issue here? Pls share your thoughts/ideas.
As I wrote this perhaps one thing I need to check is the presence of insincere questions in the validation set.
Also, can we use the ClassificationInterpretation for this (text) problem? It works but docs say it should only be used for vision.
With the constraint on compute, the usual workflow of language model then text classifier might not work here. If we want to use language modeling then it will need to be high learning rate and short, imo.
i don’t think you need to check presence of sincere samples in validation set , split step in databunch should handle it.
Most probably we are overfitting our dataset hence always predicting 1.Maybe we can increase the dropout to reduce overfiting.
Btw i was trying the same but somehow kept getting this issue and never managed to fit,
I’ve been working on this competition too. For me the bigger problem is that it’s not necessarily possible to run the pipeline from the IMDB notebook with the time and compute allotted in the kernel. At best it’s very slow. Additionally, the competition rules specify no outside data, so it’s actually against the rules to use the wikitext-103 pretrained model. There are some whitelisted pretrained models, but I don’t know how easily they plug into our model - haven’t gotten to that yet. I’m not sure if there are workarounds to these problems or if the rules for this competition just make using fastai impractical. I ran a kernel with 5% of the data and no pretrained model, just to see if it would work at all, and I was able to get through it, but with predictably terrible results. If someone has a fastai-based kernel that’s working well I would love to see it.
I tried to implement the IMDB notebook, but the restrictions of this competition makes it really hard. You don’t have the time to train a language model and you are not allowed to download the pretrained Wiki.
So we need to find another way and make use of the embeddings which are given.
thanks for sharing your notebook. I just started looking into this competition, so it’s helpful, but I can’t provide help yet.
I was wondering if you have used the embeddings that come with the Kaggle challenge?
I’ve also tried to use fastai on this competition but the time and no external data constraints don’t allow to get very far. Nevertheless I reached a public LB score of 0.607 with fastai without the provided word embeddings, under the kernel time constraints. I can share the notebook if anyone is interested.
I was trying to clean the code and add some comments but it turns out fastai version on kaggle updated meanwhile. I think the version at the time it was working was 1.0.36.post1.
Basically the solution consists of:
Training a language model (not pretrained due to competition constraints). I trained on all the training data (leaving 10% for validation) for only one epoch (due to time constraint).
Then for the classification task I loaded the encoder from the language model as usual, trained for one epoch, unfreezed and trained for 2 more epochs. This step was done with only 30% of the train data (leaving 10% for validation, again due to time constraint).
Finally, find the best threshold based on validation set and create submission csv.
All this needs to run in 2h at most.
There is clearly lot of room for improvement if we remove the constraints of time and external data!