How to submit to Kaggle?


#1

How does one submit to Kaggle for example in the House Prices competition?

Someone earlier answered by linking to the 3rd lesson of DL1, but it does not help much for the machine learning category.

Once I have my model which predicts ‘SalePrice’, what do I have to do to get the predictions for each house and save them to a CSV file along with the houses’ ids?


#2

The expected format is described here: https://www.kaggle.com/c/house-prices-advanced-regression-techniques#evaluation

So once you have the predictions do something like this:

solution = pd.DataFrame({"Id": df_test.Id, 'SalePrice': predicted_prices})
solution.to_csv('house_preds.csv', index=False)

The resulting house_preds.csv can then be uploaded to Kaggle.

Either manually via the browser or with the command line client:
kaggle competitions submit -c house-prices-advanced-regression-techniques -f house_preds.csv -m "Hello Kaggle"

Does this help?


#3

Yes, it does help!

The thing I can’t understand though is how you make predictions on a test set. My understand was that you did through the m.predict function, but to do that you first need to fit your model with X and y, right? But the Test set does not have y since we need to predict it. So how am I supposed to pre-process the test data if it doesn’t have y in it?

I tried the proc_df function on the test set without the y argument, but it doesn’t seem to work.

I guess my question is: how do you make predictions on a test set with your model?


#4

I think you have misunderstood how RF works. The “fit” method takes your X, Y and creates your model - m, on the train data. All the scoring, prediction etc has to be done on this model.

When you need to “predict” Y for a new X, you simply run the “predict” method on model “m”. So m.predict(X_test_data) should give you what you want. You don’t need to create another new model for your test data.

The general idea is that your model should be good enough for a data it has “seen” ie the train data as well as “unseen” data ie test data.


#5

proc_df works as expected, just omit the second parameter and ignore the returned y.

Other things to be aware of when processing the test data:

  • Do not use train_cats but apply_cats to make sure the same category ids are used.
  • You might end up with different columns for training and test. I think the sensible thing to do here is to add/drop columns from the test set to align with the training columns. At least this is what worked for me.

#6

Great!

To drop the columns in the test data which are not in the training data, do you call a function that finds automatically which ones to remove or should I just do that in Python?


#7

It depends. After all, it is just Python/Pandas and there are a lot of ways.

You could, e.g., explicitly drop rows.
Or you diff .columns of the respective data frames and then drop.
Or if df1.columns is a subset of df2.columns you could do something like df2[df1.columns] to just use the columns of df2 which appear in df1.

It helps to get familiar with Pandas and especially its data frames.


(Naveed Unjum) #8

I am having some problems with proc_df

It raises an error “Too many values to unpack,expected 2”.
Could you please provide a link to your kernel as to see what i am missing?


(Stephen Gabriel) #9

For example try df_features, df_target, nas = proc_df(main_dataframe, 'target_feature')

This should work, as proc_df tends to check for NaN values too and it needs to ‘unpack’ it into a _na holder. just include the ‘nas’

cheers


#10

Even when the target feature is not passed proc_df still returns three values.

You need something like
df_features, _, nas = proc_df(main_dataframe)

The underscore tells the interpreter to not save the variable.