Knowing how to clean/create your datasets for optimum models is one of the things we need to learn by doing multiple competitions. Jeremy covers some of these things in lesson 2 and 3.
But, if you really want to start on this competition you might want to think over why Jeremy had taken log of sale price in his example? My take is - It really simplified the explanation and the python notebook.
In reality, the log price is required only when we need to calculate RMSLE of our validation set. So instead of using the RMSE function:
def rmse(x,y): return math.sqrt(((x-y)**2).mean())
where x and y are log prices.
You can actually do:
def rmsle(x,y): return math.sqrt(((np.log1p(x)-np.log1p(y))**2).mean())
where x and y are normal (no log) prices.
Similarly in the Google competition you need log of user transaction but only during validation. So, run your rf as-is without logs. And when you need to check your predictions compare the log (sum of user transactions) in test and validation sets.
Has anybody else encountered this error??
you can check out this article by Rachel on the validation sets
You mean first place when submitted results to Kaggle? Then congrats well done!
if not by submitting your results and only comparing RMSE, then need to be careful to have good validation set otherwise you may have result that is overiftting the training set.
Most likely it’s overfitting to training set. Try hyperparameter tuning to reduce overfitting.
I just completed Lesson 2 and I am curious about the subsampling section of the discussion. Given that bootstrapping already subsamples the data why do we need a separate subsampling? Especially given that Jeremy clarifies that bootstrapping doesn’t work if we use the set_rf_samples function.
My understanding is that subsampling is that it is just another way of getting our model to run faster and test.
Is this the error that you are getting?
ValueError: Number of features of the model must match the input. Model n_features is 250 and input n_features is 80
If so, then you need to make sure you run one hot encoding on your test dataset. If you recall, you ran max_n_cats=7 which created additional columns in your train data set. Basically its telling you that the number of columns are not matching between your train and test dataset. Check if you dropped or added columns that are not in the test data set.
Finally if the above is still not correcting columns you can use
.align() to match columns between two dataframes. Example bellow
`train_labels = train['TARGET']
train, test = train.align(test, join = 'inner', axis = 1)
train['TARGET'] = train_labels`
I was working on lesson2-rf-interpratation and found that there a few things I needed to do to get ggplot() to work.
First I needed to include…
from ggplot import *
After that I encountered the error when trying to run ggplot()…
ImportError: cannot import name 'Timestamp'
According to this SO post, it looks like this was caused due to the import statement thats outdated in ggplot source code. While I don’t know if they are the best possible fixes, following their recommended changes, did help me run ggplot successfully so just wanted to share this for others trying to run ggplot.
Looks like we are working on the same stuff!
I read a tutorial for PDPbox and it seems like
pdp.pdp_isolate() requires you to pass in the columns for the data frame. For whatever reason the notebook was missing this parameter. (Perhaps PDPbox got updated.) I passed in the additional parameter
x.columns.values and it worked.
Specifically I did something like this…
p = pdp.pdp_isolate(m, x, x.colums.value, feat)
It worked! Thanks for the help!
I had this issue and didn’t find it in a search – here for reference.
While working on a separate dataset, a lot of the methods in
fastai.structured resulted in
SettingWithCopyWarnings, which were confusing to me since everything ran smoothly during Jeremy’s lecture.
There are a lot of ways to trigger this warning, in my case the issue was that the processed dataframe I was using was not initialized explicitly as a copy of the raw dataframe.
So something like:
df = df_raw.drop[blahblah]
df = df_raw.drop[blahblah].copy()
In the first case there’s no way for Pandas to know if
df is a ‘view’ or a copy of
df_raw, and fastai operations like
proc_df (and others) will trigger
A potentially very confusing warning; completely straightforward fix.
In the first lesson, Jeremy says to stop watching the lessons directly from YouTube.
Okay, but he didn’t provide any links to the lessons on fast.ai. Where can I find the lessons he’s referring to?
Could someone explain how to submit results to Kaggle?
It’s a pity that it wasn’t shown on the first lesson, how does one export the test set results to a csv file with the appropriate columns?
Is there a built-in function to achieve this?
My understanding is that Jeremy originally planned to make these videos official on the fast.ai website, but eventually he dropped that plan I guess. So I think the only way is to watch them from YouTube.
See this link to lesson 3 of the dl1 course: 00:32:30 Create a submission to Kaggle
@fastai1 These videos are supposed to be viewed as part of machine learning MOOC. This hasn’t been launched and not sure when it will be out. However early access to the videos is provided in in the interest of time i guess. Lets hope it will released shortly
You can now discuss these videos on the #ml1 category We’ll be launching the website for the online course early next week.
When you have several date columns in a dataframe, do you need to pass them all to the date_part function?
If so, is there way to iterate through each of the columns to find which are of the date datatype and convert them to codes?
I have tried this on two columns of my df, but I get an error in return:
for col in sliced_list:
AttributeError: 'DataFrame' object has no attribute 'col'
Or else, if I try:
for col in sliced_list:
AttributeError: 'Index' object has no attribute 'col'
Is there an easy way to iterate through a dataframe’s columns?
@fastai1 please use the #ml1 category for discussing this course now.
Hey Jeremy, sorry but I have already posted many questions on the forum you indicated, and I seem to not get any response for several days. This page seemed much more reactive to me.
Where should I post exactly to get an answer?