Wiki/lesson thread: Lesson 2

(Making a good kaggle test set is kind of a different beast and I will ignore that here to focus on the general case.) Dealing with timeseries is always tricky, but the out of bag score should be fine, depending on how you create the training set to train your model. Make sure to sort by date then maybe grab 20% for validation and train your model so that most recent dates get more weight than observations from earlier times. That’s an argument to the fit() method. If the most recent data is similar to the data beyond your 80% cut off, the out of bag score should be reasonable.

it’s useful to use the out of bag because it’s much faster than doing cross validation testing; and comes for free with the fit.

All that said, you are right. Extrapolation with random forests is not good. They’re going to predict that the future looks exactly like the most recent data in the training set. If the validation set is much different, you would in fact see out of bag score not matching validation score.

One can consider adding a feature to a random forest model that gives it a time sensitive hook or you can go to a generalized linear model etc…

1 Like

Not sure that’s entirely true. OOB is much less ideal than a time-based validation set. In the course for the bulldozers dataset we always print the score on the validation set, for this reason. OOB is really just useful for when you have a real shortage of data, or when you are explicitly looking to figure out if your model accuracy issues are due to extrapolation problems.

5 Likes

Ok, thanks for the correction. Makes sense. OOB will underestimate the error found with a true “future time” validation set. I like that comparison idea: compare OOB to validation error to highlight extrapolation weaknesses. got it. thanks.

Isn’t OOB also useful though when you don’t have a time-based set (even with lots of data)?

For sure

When I download the bulldozer dataset from Kaggle, it isn’t date sorted. When we create the validation set, the data isn’t differentiated by dates. So, does that make the OOB score and the validation set any good?

2 Likes

Yup you’ll need to sort it.

Can someone please list the kaggle competitions of similar dataset on which we can practice on and submit to the leaderboard as this bulldozer challenge does not have submit predictions option.

2 Likes

I’m using windows 10. I had this issue even after installing Graphviz. The problem was that the Path did not have the graphviz folder where dot.exe resides.

I did a search in windows explorer to locate dot.exe and added the folder to the system PATH. I had to restart the machine to have Jupyter kernel use the new path.

HTH

Thank you all for your replies!

I run Linux, so the windows 10 solution does not apply to me in this case.

It turns out I had not activated the fastai environment:

source activate fastai

This is my first message, first of all thanks for all this great content, @jeremy!!

I found very interesting your idea of creating every tree of the RF with a subsample of the original training data. I tried that approach using set_rf_samples and it made sense that it should take more or less the same time as training the RF with a subset of the data as you said, but it didn’t. I submitted an issue on github about this.

I saw that also in your case the same thing happens, taking 539 ms when you train on a subsample of the data and 3.49 s when you use set_rf_samples. Why does this happen?

Thanks!!

I am running on a kaggle kernel and getting an error “No module named fastai.structured” after running fastai.structured import *

Regarding proc_df(): When I look to the source code of proc_df, it looks to me that the data is randomly selected and not the first N rows are choosen. So therefore this set will overlap with the validation set in the provided jupyter notebook, right?
Second: I think I remember you said that set_rf_sampes cannot be used in combination with oob_scores=True. But in the provided notebook it is used in that way!?!

Work on Mac as well.

1 Like

No, actually the df returned after running proc_df actually returns exactly N no. of rows and it doesnt really overlap the validation set, you can check by simply printing the df dataframe after the splits

I have been having an issue on Paperspace with the P5000 GPU instance with the notebook kernel crashing as soon as I try to load the data in ML lesson 1 or 2.

It is dying on the line :

df_trn, y_trn, nas = proc_df(df_raw, ‘SalePrice’)

Any suggestions?

The way the data is split here:

df_trn, y_trn, nas = proc_df(df_raw, 'SalePrice', subset=30000, na_dict=nas)
X_train, _ = split_vals(df_trn, 20000)
y_train, _ = split_vals(y_trn, 20000)

Does this guarantee that the validation set will not intersect with the training set considering that the validation set has size 12,000 > 10,000 , the number of items we discard here?

So, to decide which variable to split in random forest tree, do we consider the variable with the highest correlation value with the value ?

  • at the first level
  • middle levels
    @jeremy

Getting followin error when trying to draw the tree. any hep

draw_tree(m.estimators_[0], df_trn, precision=3)


TypeError Traceback (most recent call last)
in
----> 1 draw_tree(m.estimators_[0], df_trn, precision=3)

/var/groupon/homedirs/narjunan/fastai/courses/ml1/fastai/structured.py in draw_tree(t, df, size, ratio, precision)
29 special_characters=True, rotate=True, precision=precision)
30 IPython.display.display(graphviz.Source(re.sub(‘Tree {’,
—> 31 f’Tree {{ size={size}; ratio={ratio}’, s)))
32
33 def combine_date(years, months=1, days=1, weeks=None, hours=None, minutes=None,

~/anaconda/envs/fastai/lib/python3.6/site-packages/graphviz/files.py in init(self, source, filename, directory, format, engine, encoding)
273 def init(self, source, filename=None, directory=None,
274 format=None, engine=None, encoding=File._encoding):
–> 275 super(Source, self).init(filename, directory, format, engine, encoding)
276 self.source = source #: The verbatim DOT source code string.
277

TypeError: super(type, obj): obj must be an instance or subtype of type

A bit late but I am using the house prices data set. It’s for training if you don’t have much experience though I have to say with all these techniques I am slightly below 50%. I suppose it doesn’t always place you in the top 100.

Hello Jeremy,

When you were going through the lesson for the students in the class, I suppose the dataset was already date sorted and that is the reason why you did not explicitly sort it. But for us, who download the dataset from Kaggle, we should sort it, right?

1 Like