Wiki/lesson thread: Lesson 2

(Luis Ortega) #25

I’m using windows 10. I had this issue even after installing Graphviz. The problem was that the Path did not have the graphviz folder where dot.exe resides.

I did a search in windows explorer to locate dot.exe and added the folder to the system PATH. I had to restart the machine to have Jupyter kernel use the new path.




Thank you all for your replies!

I run Linux, so the windows 10 solution does not apply to me in this case.

It turns out I had not activated the fastai environment:

source activate fastai

(Xoel López) #27

This is my first message, first of all thanks for all this great content, @jeremy!!

I found very interesting your idea of creating every tree of the RF with a subsample of the original training data. I tried that approach using set_rf_samples and it made sense that it should take more or less the same time as training the RF with a subset of the data as you said, but it didn’t. I submitted an issue on github about this.

I saw that also in your case the same thing happens, taking 539 ms when you train on a subsample of the data and 3.49 s when you use set_rf_samples. Why does this happen?



(Mayank) #28

I am running on a kaggle kernel and getting an error “No module named fastai.structured” after running fastai.structured import *


(Christian Baumberger) #29

Regarding proc_df(): When I look to the source code of proc_df, it looks to me that the data is randomly selected and not the first N rows are choosen. So therefore this set will overlap with the validation set in the provided jupyter notebook, right?
Second: I think I remember you said that set_rf_sampes cannot be used in combination with oob_scores=True. But in the provided notebook it is used in that way!?!


(Fadhli Ismail) #30

Work on Mac as well.

1 Like

(vatsal bharti) #32

No, actually the df returned after running proc_df actually returns exactly N no. of rows and it doesnt really overlap the validation set, you can check by simply printing the df dataframe after the splits


(David Carroll) #34

I have been having an issue on Paperspace with the P5000 GPU instance with the notebook kernel crashing as soon as I try to load the data in ML lesson 1 or 2.

It is dying on the line :

df_trn, y_trn, nas = proc_df(df_raw, ‘SalePrice’)

Any suggestions?


(Gerges Dib) #35

The way the data is split here:

df_trn, y_trn, nas = proc_df(df_raw, 'SalePrice', subset=30000, na_dict=nas)
X_train, _ = split_vals(df_trn, 20000)
y_train, _ = split_vals(y_trn, 20000)

Does this guarantee that the validation set will not intersect with the training set considering that the validation set has size 12,000 > 10,000 , the number of items we discard here?


(Spandan) #36

So, to decide which variable to split in random forest tree, do we consider the variable with the highest correlation value with the value ?

  • at the first level
  • middle levels

(Naveenan Arjunan) #37

Getting followin error when trying to draw the tree. any hep

draw_tree(m.estimators_[0], df_trn, precision=3)

TypeError Traceback (most recent call last)
----> 1 draw_tree(m.estimators_[0], df_trn, precision=3)

/var/groupon/homedirs/narjunan/fastai/courses/ml1/fastai/ in draw_tree(t, df, size, ratio, precision)
29 special_characters=True, rotate=True, precision=precision)
30 IPython.display.display(graphviz.Source(re.sub(‘Tree {’,
—> 31 f’Tree {{ size={size}; ratio={ratio}’, s)))
33 def combine_date(years, months=1, days=1, weeks=None, hours=None, minutes=None,

~/anaconda/envs/fastai/lib/python3.6/site-packages/graphviz/ in init(self, source, filename, directory, format, engine, encoding)
273 def init(self, source, filename=None, directory=None,
274 format=None, engine=None, encoding=File._encoding):
–> 275 super(Source, self).init(filename, directory, format, engine, encoding)
276 self.source = source #: The verbatim DOT source code string.

TypeError: super(type, obj): obj must be an instance or subtype of type


(Jonas) #38

A bit late but I am using the house prices data set. It’s for training if you don’t have much experience though I have to say with all these techniques I am slightly below 50%. I suppose it doesn’t always place you in the top 100.


(Nitin George Cherian) #39

Hello Jeremy,

When you were going through the lesson for the students in the class, I suppose the dataset was already date sorted and that is the reason why you did not explicitly sort it. But for us, who download the dataset from Kaggle, we should sort it, right?


(Nitin George Cherian) #40

I took a look at df_raw.saleYear.head() output after the add_datepart(df_raw, 'saledate') was executed and see that the date is not sorted. Ideally the dataset should be sorted by saledate, right?


(Ruben) #41

Yes, it should in order to reflect the distribution of the validation set.

Jeremy already said that above Wiki/lesson thread: Lesson 2

Dates are invented for illustration purposes:
Training: sales on year 1980 … 2010
Validation: sales on year 2011

As we don’t have a separate Validation Set, we create one picking the last N rows from the Training Set

Here is an example of how to sort (there may be a better one).


(Robyn Leigh Smith) #42

Please would you post the source code for draw_tree.


(Robyn Leigh Smith) #43

Please would you post the source code for draw_tree


(Andrei Stoica) #44

Hi guys,

First of all I would like to thank @jeremy for the great lectures.
While following the part of the jupyter notebook that corresponds to Lesson 2, I encountered a quite significant difference between my results and those obtained by Jeremy in the video.
More specifically, when running the two snippets in the Speeding things up section, the output of the print_score() function in my case is about [0.11, 0.35, 0.97, 0.77] (Jeremy got ~ [0.11, 0.27, 0.97, 0.85]). I ran it multiple times in the last day, and I even redownloaded the repository to make sure I have a clean version of the notebook, but the results are the same. My problem is that the RMSE for the validation set is much bigger than in Jeremy’s case (~ 0.27), whereas the R2 for the same validation set is smaller (Jeremy got ~ 0.85). I am pretty sure I haven’t messed up with the validation set, but I am unable to understand why there are such significant differences. For the base model (so when using the whole training set with 389125 rows), my results are very similar to Jeremy’s (~ [0.09, 0.24, 0.97, 0.89]).
The only difference I could see between the version of the notebook I am running and the one that appears in the video is that in my case the first line in this section is df_trn, y_trn, nas = proc_df(df_raw, 'SalePrice', subset=30000, na_dict=nas), while in the video it is df_trn, y_trn = proc_df(df_raw, 'SalePrice', subset=30000) (this latter version gives an error when I try to run it).
I should also mention that the numbers change only slightly when increasing the number of estimators, but they are still far away from 0.27 and 0.85, respectively.
When looking at several kernels (all from the last 4-5 months) on the Kaggle forum, I saw that some of them have values closer to mine, whereas others are more in line with Jeremy’s.
I would really appreciate any help or comments you can give on this matter. Thanks a lot.



(Peter Koman) #45
df_trn, y_trn, nas = proc_df(df_raw, 'SalePrice', subset=30000)
X_train, _ = split_vals(df_trn, 20000)
y_train, _ = split_vals(y_trn, 20000)

if subset takes first 30k rows, is there really any need to do split_vals after?