Wiki/lesson thread: Lesson 2

(Luis Ortega) #25

I’m using windows 10. I had this issue even after installing Graphviz. The problem was that the Path did not have the graphviz folder where dot.exe resides.

I did a search in windows explorer to locate dot.exe and added the folder to the system PATH. I had to restart the machine to have Jupyter kernel use the new path.



Thank you all for your replies!

I run Linux, so the windows 10 solution does not apply to me in this case.

It turns out I had not activated the fastai environment:

source activate fastai

(Xoel López) #27

This is my first message, first of all thanks for all this great content, @jeremy!!

I found very interesting your idea of creating every tree of the RF with a subsample of the original training data. I tried that approach using set_rf_samples and it made sense that it should take more or less the same time as training the RF with a subset of the data as you said, but it didn’t. I submitted an issue on github about this.

I saw that also in your case the same thing happens, taking 539 ms when you train on a subsample of the data and 3.49 s when you use set_rf_samples. Why does this happen?


(Mayank) #28

I am running on a kaggle kernel and getting an error “No module named fastai.structured” after running fastai.structured import *

(Christian Baumberger) #29

Regarding proc_df(): When I look to the source code of proc_df, it looks to me that the data is randomly selected and not the first N rows are choosen. So therefore this set will overlap with the validation set in the provided jupyter notebook, right?
Second: I think I remember you said that set_rf_sampes cannot be used in combination with oob_scores=True. But in the provided notebook it is used in that way!?!

(Fadhli Ismail) #30

Work on Mac as well.

(vatsal bharti) #32

No, actually the df returned after running proc_df actually returns exactly N no. of rows and it doesnt really overlap the validation set, you can check by simply printing the df dataframe after the splits

(David Carroll) #34

I have been having an issue on Paperspace with the P5000 GPU instance with the notebook kernel crashing as soon as I try to load the data in ML lesson 1 or 2.

It is dying on the line :

df_trn, y_trn, nas = proc_df(df_raw, ‘SalePrice’)

Any suggestions?

(Gerges Dib) #35

The way the data is split here:

df_trn, y_trn, nas = proc_df(df_raw, 'SalePrice', subset=30000, na_dict=nas)
X_train, _ = split_vals(df_trn, 20000)
y_train, _ = split_vals(y_trn, 20000)

Does this guarantee that the validation set will not intersect with the training set considering that the validation set has size 12,000 > 10,000 , the number of items we discard here?

(Spandan) #36

So, to decide which variable to split in random forest tree, do we consider the variable with the highest correlation value with the value ?

  • at the first level
  • middle levels