Knowing how to clean/create your datasets for optimum models is one of the things we need to learn by doing multiple competitions. Jeremy covers some of these things in lesson 2 and 3.
But, if you really want to start on this competition you might want to think over why Jeremy had taken log of sale price in his example? My take is - It really simplified the explanation and the python notebook.
In reality, the log price is required only when we need to calculate RMSLE of our validation set. So instead of using the RMSE function:
Similarly in the Google competition you need log of user transaction but only during validation. So, run your rf as-is without logs. And when you need to check your predictions compare the log (sum of user transactions) in test and validation sets.
You mean first place when submitted results to Kaggle? Then congrats well done!
if not by submitting your results and only comparing RMSE, then need to be careful to have good validation set otherwise you may have result that is overiftting the training set.
Most likely it’s overfitting to training set. Try hyperparameter tuning to reduce overfitting.
I just completed Lesson 2 and I am curious about the subsampling section of the discussion. Given that bootstrapping already subsamples the data why do we need a separate subsampling? Especially given that Jeremy clarifies that bootstrapping doesn’t work if we use the set_rf_samples function.
My understanding is that subsampling is that it is just another way of getting our model to run faster and test.
Is this the error that you are getting? ValueError: Number of features of the model must match the input. Model n_features is 250 and input n_features is 80
If so, then you need to make sure you run one hot encoding on your test dataset. If you recall, you ran max_n_cats=7 which created additional columns in your train data set. Basically its telling you that the number of columns are not matching between your train and test dataset. Check if you dropped or added columns that are not in the test data set.
Finally if the above is still not correcting columns you can use .align() to match columns between two dataframes. Example bellow
I was working on lesson2-rf-interpratation and found that there a few things I needed to do to get ggplot() to work.
First I needed to include… from ggplot import *
After that I encountered the error when trying to run ggplot()… ImportError: cannot import name 'Timestamp'
According to this SO post, it looks like this was caused due to the import statement thats outdated in ggplot source code. While I don’t know if they are the best possible fixes, following their recommended changes, did help me run ggplot successfully so just wanted to share this for others trying to run ggplot.
I read a tutorial for PDPbox and it seems like pdp.pdp_isolate() requires you to pass in the columns for the data frame. For whatever reason the notebook was missing this parameter. (Perhaps PDPbox got updated.) I passed in the additional parameter x.columns.values and it worked.
Specifically I did something like this… p = pdp.pdp_isolate(m, x, x.colums.value, feat)
While working on a separate dataset, a lot of the methods in fastai.structured resulted in SettingWithCopyWarnings, which were confusing to me since everything ran smoothly during Jeremy’s lecture.
There are a lot of ways to trigger this warning, in my case the issue was that the processed dataframe I was using was not initialized explicitly as a copy of the raw dataframe.
So something like:
df = df_raw.drop[blahblah]
instead of:
df = df_raw.drop[blahblah].copy()
In the first case there’s no way for Pandas to know if df is a ‘view’ or a copy of df_raw, and fastai operations like add_datepart and proc_df (and others) will trigger SettingWithCopyWarnings.
A potentially very confusing warning; completely straightforward fix.
My understanding is that Jeremy originally planned to make these videos official on the fast.ai website, but eventually he dropped that plan I guess. So I think the only way is to watch them from YouTube.
@fastai1 These videos are supposed to be viewed as part of machine learning MOOC. This hasn’t been launched and not sure when it will be out. However early access to the videos is provided in in the interest of time i guess. Lets hope it will released shortly
Hey Jeremy, sorry but I have already posted many questions on the forum you indicated, and I seem to not get any response for several days. This page seemed much more reactive to me.