How to submit to Kaggle?

fastai1 · October 7, 2018, 3:49pm

How does one submit to Kaggle for example in the House Prices competition?

Someone earlier answered by linking to the 3rd lesson of DL1, but it does not help much for the machine learning category.

Once I have my model which predicts ‘SalePrice’, what do I have to do to get the predictions for each house and save them to a CSV file along with the houses’ ids?

aberres · October 7, 2018, 4:14pm

The expected format is described here: https://www.kaggle.com/c/house-prices-advanced-regression-techniques#evaluation

So once you have the predictions do something like this:

solution = pd.DataFrame({"Id": df_test.Id, 'SalePrice': predicted_prices})
solution.to_csv('house_preds.csv', index=False)

The resulting house_preds.csv can then be uploaded to Kaggle.

Either manually via the browser or with the command line client:
kaggle competitions submit -c house-prices-advanced-regression-techniques -f house_preds.csv -m "Hello Kaggle"

Does this help?

fastai1 · October 8, 2018, 6:25am

Yes, it does help!

The thing I can’t understand though is how you make predictions on a test set. My understand was that you did through the m.predict function, but to do that you first need to fit your model with X and y, right? But the Test set does not have y since we need to predict it. So how am I supposed to pre-process the test data if it doesn’t have y in it?

I tried the proc_df function on the test set without the y argument, but it doesn’t seem to work.

I guess my question is: how do you make predictions on a test set with your model?

number007 · October 8, 2018, 6:36am

I think you have misunderstood how RF works. The “fit” method takes your X, Y and creates your model - m, on the train data. All the scoring, prediction etc has to be done on this model.

When you need to “predict” Y for a new X, you simply run the “predict” method on model “m”. So m.predict(X_test_data) should give you what you want. You don’t need to create another new model for your test data.

The general idea is that your model should be good enough for a data it has “seen” ie the train data as well as “unseen” data ie test data.

aberres · October 8, 2018, 7:03am

proc_df works as expected, just omit the second parameter and ignore the returned y.

Other things to be aware of when processing the test data:

Do not use train_cats but apply_cats to make sure the same category ids are used.
You might end up with different columns for training and test. I think the sensible thing to do here is to add/drop columns from the test set to align with the training columns. At least this is what worked for me.

fastai1 · October 8, 2018, 5:23pm

Great!

To drop the columns in the test data which are not in the training data, do you call a function that finds automatically which ones to remove or should I just do that in Python?

aberres · October 9, 2018, 7:16am

It depends. After all, it is just Python/Pandas and there are a lot of ways.

You could, e.g., explicitly drop rows.
Or you diff .columns of the respective data frames and then drop.
Or if df1.columns is a subset of df2.columns you could do something like df2[df1.columns] to just use the columns of df2 which appear in df1.

It helps to get familiar with Pandas and especially its data frames.

naveed · October 13, 2018, 4:53pm

I am having some problems with proc_df

It raises an error “Too many values to unpack,expected 2”.
Could you please provide a link to your kernel as to see what i am missing?

cyberdroidmann · October 15, 2018, 12:09am

For example try df_features, df_target, nas = proc_df(main_dataframe, 'target_feature')

This should work, as proc_df tends to check for NaN values too and it needs to ‘unpack’ it into a _na holder. just include the ‘nas’

cheers

aberres · October 16, 2018, 7:28am

Even when the target feature is not passed proc_df still returns three values.

You need something like
df_features, _, nas = proc_df(main_dataframe)

The underscore tells the interpreter to not save the variable.

mayank4 · October 21, 2018, 9:29pm

I have used
df, y, nas = proc_df(df_raw, ‘SalePrice’, max_n_cat = 20)
after which I am getting 310 features in df.
But while predicting on test set, dimensions are different.
I have also used apply_cats but dimensions of the test data are same as before.

hud · November 14, 2018, 2:45am

Hey, great stuff. Followed lesson 1 code and got house_preds.csv with dimensions 1461 x 2 columns Id and SalePrice. However, the submission went through to kaggle with an error.

I have a hunch it’s the number of rows which are not matching, but not sure. Anyone else experienced this problem ?

rgarcia · July 25, 2019, 4:00pm

If you are running on a kaggle kernel, you can submit directly without moving files around.

A quick way to make your first submission(fast.ai) | Kaggle
https://www.kaggle.com/ishivinal/a-quick-way-to-make-your-first-submission-fast-ai

Submitting any file from within Kaggle Kernel with command line regardless of kernel type | Kaggle
https://www.kaggle.com/questions-and-answers/101395