I’m trying the Quora competition on Kaggle. It’s very similar to IMDB lesson 3. I’ve trained a model using the IMDB notebook as a guideline and now I want to submit to see how I’m doing. The submission is supposed to be a csv with two columns: qid (the question id #) and prediction (0 or 1). I know I’m supposed to use my model to make predictions on the test data, but I haven’t been able to quite figure out how to do this.
Do I need learn.predict or learn.get_preds?
Can I pass something like ds_type=‘test’ as a parameter to the above, or do I need to explicitly load in the test data?
If the latter, do I need to perform processing (tokenization etc.) on the test data exactly as I did on the training data?
So normally what you would do is either create the test dataset as part of your training dataset, create a new dataset out of the test set. This would cause it to be tokenized the same way using the same vocabulary.
However for this particular competition it’s important to read the rules. This competition is kernels only, meaning you need to run your model training process and submission from a Kaggle kernel. The competition also prohibits using external data sources. So if you followed Lesson 3 and created a model starting with a pretrained Wikitext 103 model, your model is against the competition rules.
Thanks, I was able to get it to work by including the test dataset as part of my training set at the beginning, like you said, then rerunning the whole pipeline.
data_clas = TextClasDataBunch.from_csv(path, ‘small.csv’, valid_pct=0.1, text_cols=1, label_cols=2, vocab=data_lm.vocab, test=‘test.csv’)
Then after the whole training process…
preds = learn.get_preds(ds_type=DatasetType.Test, ordered=True)
And finally doing some fairly standard python/pandas stuff to get it into the appropriate format for submission.
But the external data sources rule does throw a wrench in the works. I actually did read that, but I assumed it referred to something like dredging up additional Quora data. Are you sure it applies to generic pre-trained language models? I’m afraid you’re right but wanted to make sure
I believe it does. Per the longer rules page
EXTERNAL DATA USE: External data not permitted, but whitelisted pre-trained models are permitted.
Where the whitelisted pre-trained models are the word vectors they link to.
Also when a competition allows outside pre-trained models, there is always a discussion thread for disclosing models.
Here’s a kernel that runs the basic lesson 3 process without pre-training.
Thanks for sharing the kaggle kernel here. I tried to understand your kernel step by step after going through lesson 3. Being a newbie here I am unable to understand the part of code where you have used “language_model_learner” and “text_classifier_learner”. Unlike lesson 3 notebook i could not find any pretrained model. I am a bit confused there. Any help would be appreciated. Thanks