Dear all, dear Jeremy,
I just finished the lesson 4 of the course and it’s not clear to me how the expected depth of a random forest is calculated, as was shown in the video at around the 4:00 mark to be log2(n).
It would be great if somebody could provide any pointers or a short demonstration of how to derive that.
Thank you very much!
@gabrielfior, My understanding is this. If you have 256 items in your dataset, and you build a tree in such a way that one item ends up in each leaf node at the bottom, then you’ll need 256 nodes at the bottom. So you have to have 9 levels, each one doubling the level before (as 2^9=256). 2^k = number of leaf nodes means log2(n)=k, the number of levels. Does that help?
Hi Jeremy, thanks for posting these videos. I have a few questions:
- I’ve watched lesson 1 - 8. Should I continue watching lessons 9 - 12 if I intend to watch Practical Deep Learning For Coders anyway?
- In lesson 8, 04:40, you said that random forests would not work well for the Ecuadorian-based grocery competition because they don’t extrapolate.
It seems like you were implying that deep learning would work better. Yet, my impression was that deep learning is for unstructured data; random forest for structured data.
Am I misunderstanding something?
- If we have an unbalanced data set but not too unbalanced (maybe 66.66% yes, 33.33% no), is it still important during bootstrap sampling to sample with weights (i.e. sample no twice as often as yes)?
- Related to the previous question, is it possible to sample with weights using set_rf_samples()?
- You mentioned that you would talk more about extrapolation. Is it just removing time related independent variables as covered in lesson 5 or is there more to it?
- You mentioned in lesson 2, 1:18:40, that oob_scores are not compatible with set_rf_samples(). Why not? Does that mean we should not use oob_scores with set_rf_samples() at all?
Again, thanks for posting the videos. They’re really helpful.
Hi, I hope this clears things up a little bit.
- As Jeremy mention, we will build things up from simple Logistic Regression, SGD, Ridge, Lasso, all the way up to Fully connected NN. If you haven’t done these stuff, I think watch lesson 9-12 is really helpful.
- The first part of question, I’m not sure how to answer. But we use Light GBM and it works well in that competition. The 2nd part, Deep Learning still works for structure data, like using Embedding to encode categorical feature into dense vector. If you watch DL1, lesson 3 (If I remember), you know what I mean. Or using DL to time-series problem. In contrast, using Random Forest for unstructured data still works as long as we can best represent features for these kind of data. After all, they are all Machine Learning tools. However, DL is better in unstructured data.
- We have to try. It really depends on problem.
- I think it is possible.
- Still waiting for more on this
- I don’t know too.
At the end of lesson 7, Jeremy suggested us to try writing our own implementation of Feature Importance.
I tried to do it myself, and I’ve came up with this (naive) implementation : https://gist.github.com/Polegar22/f9019bf80803758a6b1323217d31a99a.
But the results I got from my code are pretty different than the one I got with sklearn, especially with Coupler_System.
What am I missing ?
It turns out that sklearn uses gini importance, not permutation importance - which is the opposite of what I said in class. We just wrote an article on this: http://parrt.cs.usfca.edu/doc/rf-importance/index.html
If it’s a really large dataset, then it’ll be a really large OOB sample, so you’ll be spending all your time waiting for that. Which makes the RF sampling rather pointless.
So the oob_score is actually still valid? It’s just slow?
Yes that’s right.
Funny - I was just writing a Medium post about feature importance (with real estate prices as the example, no less!) and, without looking at the fast.ai code, figured fast.ai used permutation importance as described in lesson 4 of the course. Are you planning to add your permutation importance function to fast.ai (structured.py)? I see you’ve got a function in the linked article.
In a somewhat related query:
What representation of categorical variables is used to calculate spearmans r (rank correlation) for the hierarchical clustering you do in lesson 4? It seems like correlations on categorical features that are not ordinal would behave badly. Is this not a concern?
I have gone through lectures 1 and 2 so far where Jeremy covers Random Forest with Kaggle competition example of predicting auction sales price of Bulldozers.
The lecture inspired me to try RandomForest to predict Burger sales for my friend’s store.
I have 3 years worth of daily sales data. effectively 1095 rows. Each row has 2 fields: Date and SalesAmount.
When I tried default RandomForest notebook (similar to lesson1-rf) with 1000 training rows and 95 validation row split on this small dataset, RMSE on validation set is way too high and R2 is way too low.
[0.08404532792474446, 0.25379158877751373, 0.9395630247323464, 0.10681821420504334]
Here are my questions:
Is Random Forest a reasonable algorithm to try when the dataset is small and you essentially have time series data as is the case here about 1000 rows. If so, what else could i do to improve the numbers.
Are there other better ML algorithms to try and predict the sales. I was reading about ARIMA models online and am going to try that. would ARIMA be able to capture the seasonality aspect. what would be good values to give for ARIMA(p,d,q) if this is a reasonable approach.
Any pointers on how to proceed would be super helpful!
Two things I’ll suggest:
- Have you used add_datepart function from fast.ai to break the date into a bunch of variables that can help you capture things like seasonality?
- video 5 Jeremy talks about extrapolation, starting here. Have you followed along with that?
- yes, I did use add_datepart function which added the following date related fields:
saleYear saleMonth saleWeek saleDay saleDayofweek saleDayofyear saleIs_month_end saleIs_month_start saleIs_quarter_end saleIs_quarter_start saleIs_year_end saleIs_year_start saleElapsed
- I haven’t tried the extrapolation, will give it a shot.
Thanks for pointers!
No problem. Another idea would be to add features. You can get feature ideas from here:
For example, temperature, holidays, google trends.
Hi All, I am a backend coder by profession. Looking at the advance in AI and ML, I am really intrigued by it. Someone suggested me to look at Fast.AI lessons to get started. Now looking at the courses I found this and “Deep Learning for Coders”. What will be the best way to start? Should I go for this set of lesson or DL1 instead?
You can do one or both, in either order or simultaneously – but if you are brand new to machine learning in general and deep learning in particular, my personal recommendation is to take the ML course first, as it provides some basic grounding in training vs. validation and other universal concepts.
In lesson 3, for the grocery competition, jeremy turns the sales into:
df_all.unit_sales = np.log1p(np.clip(df_all.unit_sales, 0, None))
The np.clip is supposed to remove negative sales and consider them as zero as per the competition. On checking the competition data description, it say:
Negative values of unit_sales represent returns of that particular item.
But it doesn’t ask us to change the negative sale values to zeroes.
And wouldn’t doing this change our prediction too?
When trying to draw the decision tree:
I got an error: CalledProcessError: Command ‘[‘dot’, ‘-Tsvg’]’ returned non-zero exit status 1
I have brew install graphviz to get the latest and my “dot” is:
dot - graphviz version 2.40.1 (20161225.0304)
Anyone know what I can do to solve this?
Have you seen a great article on this topic as i understand why we want to up-sample the class, but I haven’t seen it done on a problem where i can follow along and get more than just the theoretical ideas.
This problem happens so much that becoming a practitioner on this subject would be worth while I believe.
Thanks in advance