Wiki/lesson thread: Lesson 2

(Luis Ortega) #25

I’m using windows 10. I had this issue even after installing Graphviz. The problem was that the Path did not have the graphviz folder where dot.exe resides.

I did a search in windows explorer to locate dot.exe and added the folder to the system PATH. I had to restart the machine to have Jupyter kernel use the new path.




Thank you all for your replies!

I run Linux, so the windows 10 solution does not apply to me in this case.

It turns out I had not activated the fastai environment:

source activate fastai

(Xoel López) #27

This is my first message, first of all thanks for all this great content, @jeremy!!

I found very interesting your idea of creating every tree of the RF with a subsample of the original training data. I tried that approach using set_rf_samples and it made sense that it should take more or less the same time as training the RF with a subset of the data as you said, but it didn’t. I submitted an issue on github about this.

I saw that also in your case the same thing happens, taking 539 ms when you train on a subsample of the data and 3.49 s when you use set_rf_samples. Why does this happen?



(Mayank) #28

I am running on a kaggle kernel and getting an error “No module named fastai.structured” after running fastai.structured import *


(Christian Baumberger) #29

Regarding proc_df(): When I look to the source code of proc_df, it looks to me that the data is randomly selected and not the first N rows are choosen. So therefore this set will overlap with the validation set in the provided jupyter notebook, right?
Second: I think I remember you said that set_rf_sampes cannot be used in combination with oob_scores=True. But in the provided notebook it is used in that way!?!


(Fadhli Ismail) #30

Work on Mac as well.

1 Like

(vatsal bharti) #32

No, actually the df returned after running proc_df actually returns exactly N no. of rows and it doesnt really overlap the validation set, you can check by simply printing the df dataframe after the splits


(David Carroll) #34

I have been having an issue on Paperspace with the P5000 GPU instance with the notebook kernel crashing as soon as I try to load the data in ML lesson 1 or 2.

It is dying on the line :

df_trn, y_trn, nas = proc_df(df_raw, ‘SalePrice’)

Any suggestions?


(Gerges Dib) #35

The way the data is split here:

df_trn, y_trn, nas = proc_df(df_raw, 'SalePrice', subset=30000, na_dict=nas)
X_train, _ = split_vals(df_trn, 20000)
y_train, _ = split_vals(y_trn, 20000)

Does this guarantee that the validation set will not intersect with the training set considering that the validation set has size 12,000 > 10,000 , the number of items we discard here?


(Spandan) #36

So, to decide which variable to split in random forest tree, do we consider the variable with the highest correlation value with the value ?

  • at the first level
  • middle levels

(Naveenan Arjunan) #37

Getting followin error when trying to draw the tree. any hep

draw_tree(m.estimators_[0], df_trn, precision=3)

TypeError Traceback (most recent call last)
----> 1 draw_tree(m.estimators_[0], df_trn, precision=3)

/var/groupon/homedirs/narjunan/fastai/courses/ml1/fastai/ in draw_tree(t, df, size, ratio, precision)
29 special_characters=True, rotate=True, precision=precision)
30 IPython.display.display(graphviz.Source(re.sub(‘Tree {’,
—> 31 f’Tree {{ size={size}; ratio={ratio}’, s)))
33 def combine_date(years, months=1, days=1, weeks=None, hours=None, minutes=None,

~/anaconda/envs/fastai/lib/python3.6/site-packages/graphviz/ in init(self, source, filename, directory, format, engine, encoding)
273 def init(self, source, filename=None, directory=None,
274 format=None, engine=None, encoding=File._encoding):
–> 275 super(Source, self).init(filename, directory, format, engine, encoding)
276 self.source = source #: The verbatim DOT source code string.

TypeError: super(type, obj): obj must be an instance or subtype of type


(Jonas) #38

A bit late but I am using the house prices data set. It’s for training if you don’t have much experience though I have to say with all these techniques I am slightly below 50%. I suppose it doesn’t always place you in the top 100.


(Nitin George Cherian) #39

Hello Jeremy,

When you were going through the lesson for the students in the class, I suppose the dataset was already date sorted and that is the reason why you did not explicitly sort it. But for us, who download the dataset from Kaggle, we should sort it, right?


(Nitin George Cherian) #40

I took a look at df_raw.saleYear.head() output after the add_datepart(df_raw, 'saledate') was executed and see that the date is not sorted. Ideally the dataset should be sorted by saledate, right?


(Ruben) #41

Yes, it should in order to reflect the distribution of the validation set.

Jeremy already said that above Wiki/lesson thread: Lesson 2

Dates are invented for illustration purposes:
Training: sales on year 1980 … 2010
Validation: sales on year 2011

As we don’t have a separate Validation Set, we create one picking the last N rows from the Training Set

Here is an example of how to sort (there may be a better one).