If you are using RandomForest. Maybe you can try to oversample the class until the distribution is balance? (I remember Jeremy have kind of talk about this trick, but I haven’t used it before)
I guess (not sure) it won’t help per se because for RF it’s still the same thing(single copy or multiple copies…)?
First of all thanks for this series Jeremy. It’s pretty excellent. I have a doubt in the second lecture, that I hope someone can help me with.
In the fastai library, in the structured.py file there is this line on the starting imports bit
from sklearn.ensemble import forest
I’m looking in the sklearn API but I’m unable to find forest
.
Someone please help me with this. All I am able to understand here is that set_rf_samples
changes something in the API itself, and I’d like to understand how this works and if it’s something we can use with other algorithms like SVM?
Hi, also in the second lesson, while executing the cell:
draw_tree(m.estimators_[0], df_trn, precision = 3)
I get the following error:
failed to execute ['dot', '-Tsvg'],
make sure the Graphviz executables are on your systems' PATH
I thought maybe graphviz isn’t installed, but import graphviz
works in my fastai environment. I’m running this code on my windows 10 laptop. Also, since the whole thing is installed in the fastai
conda environment, I don’t know if I should be messing around in the PATH.
see envs\fastai\Lib\site-packages\sklearn\ensemble\forest.py for documentation (below) and what it imports.
“”"Forest of trees-based ensemble methods
Those methods include random forests and extremely randomized trees.
The module structure is the following:
The
BaseForest
base class implements a commonfit
method for all
the estimators in the module. Thefit
method of the baseForest
class calls thefit
method of each sub-estimator on random samples
(with replacement, a.k.a. bootstrap) of the training set.The init of the sub-estimator is further delegated to the
BaseEnsemble
constructor.The
ForestClassifier
andForestRegressor
base classes further
implement the prediction logic by computing an average of the predicted
outcomes of the sub-estimators.The
RandomForestClassifier
andRandomForestRegressor
derived
classes provide the user with concrete implementations of
the forest ensemble method using classical, deterministic
DecisionTreeClassifier
andDecisionTreeRegressor
as
sub-estimator implementations.The
ExtraTreesClassifier
andExtraTreesRegressor
derived
classes provide the user with concrete implementations of the
forest ensemble method using the extremely randomized trees
ExtraTreeClassifier
andExtraTreeRegressor
as
sub-estimator implementations.Single and multi-output problems are both handled.
I just came across TPOT: https://epistasislab.github.io/tpot/
It looks like it tries a bunch of machine learning techniques from scikit-learn and tries to identify the best approach for your problem, once you’ve prepared the data.
2 thoughts:
- if this included all the fastai tips and tricks: wow!
- even if it doesn’t include all the fastai tweaks. it would sound relatively simple to turn this into a validation set tester. i.e. take 5 or so different models from TPOT, with 5 different scores, predict the test set on those 5 and show the valid_scores vs test_scores graph as shown in an ML lecture
I’m still too beginner to know where this sits between brain fart and a good idea. Let me know!
Installing python-graphviz definitely worked.
conda install python-graphviz
to be specific. Thanks @Brad_S
Also, TPOT looks pretty cool though I’m also in no position to judge.
Here is one other thing I haven’t been able to figure out:
In the speeding things up section, where you pick out a subset of 30,000 samples from the original set, and then the first 20,000 of those for training, my r^2 score on the original validation set is surprisingly ~0.76, compared to ~0.86 that Jamie gets in his lecture. I can’t explain why this is happening at all. The only possible thing I can think of is that I’ve somehow downloaded the wrong data, but I can’t see how.
Since the data isn’t in the github repo, I got it from the competition page here: https://www.kaggle.com/c/bluebook-for-bulldozers/data
I downloaded the Train.zip file which is ~7 MB in size.
And then once you get to the section on OOB score, that is computed to be ~0.85 which is along the lines of what is shown in the lecture.
Finally once we start subsampling, things return to normal and the calculated r^2 scores are in line with what is seen in the lecture.
This doesn’t feel like it’s a major problem, since we will be using subsampling anyway, but I was wondering if this happened to anyone else and what is causing this.
Find my notebook Here: https://github.com/keshav-c/bulldozers/blob/master/lesson1-rf.ipynb
Edit: I was a bit too impatient here. The problem is addressed in the second half of lecture 3.
Hi all, I came across this article on “Towards Data Science” platform which is titled as “Machine Learning Zero-to-Hero: Everything you need in order to compete on Kaggle for the first time, step-by-step![1]”. This is definitely a good read. Yes Jeremy also covers this topic. I firmly believe that you’ll find this extremely useful.
[1] https://towardsdatascience.com/machine-learning-zero-to-hero-everything-you-need-in-order-to-compete-on-kaggle-for-the-first-time-18644e701cf1
If by ‘Jeremy also covers this topic’ you mean ‘I learned most of what I know from Jeremy’ then yes, correct
Other than that, thanks a lot for the positive feedback!!
@orendar it’s a great article! FYI I went to share it on twitter, but because your twitter handle isn’t in your medium profile, it didn’t credit you properly. You may want to add it.
You’re doing great! Here’s the thing to think about: regularization penalizes coeffs that are larger. By using NB features, we don’t have to use such large coeffs to get the same result, compared to using plain binary features.
Once you’ve understood that, you’ll soon realize that NB-SVM still isn’t ideal - since we’d really like a zero coeff to represent our prior expectation as to the behavior of that feature. At that point, we can start to talk about the extension I made to NB-SVM which is the current state of the art in linear models for sentiment analysis! (Which no-one has written up yet - so if you get to the point you understand this bit, you can be the first person to put it down in writing… You’re well on the way to being there.)
is there a way to follow you on twitter without being on twitter? rss to email or somesuch?
you know … just while I’m spending all my time on here learning like crazy and already being distracted by so much to read
how did you manage to solve this ?
It’s not an error…
It always happen no matter what I do…
You can try adding
C:\ProgramData\Anaconda3\Library\bin\graphviz in user variable section of environment variables section.
When trying to execute
fi = rf_feat_importance(m, df_trn); fi[:10]
in Feature Importance section of Notebook 2 “lesson2-rf_interpretation” I am getting :
ValueError: arrays must all be same length.
Your full notebook will help…
Hi all,
I’m while attempting the Kaggle competition[1] “House Prices: Advanced Regression Techniques”. I pretty much follow the exact same approach and I managed to obtain a score of 0.94 with RandomForestRegressor. My next step was to use test data set (this competition provides both training and test data separately). I pretty much did the exact same things to the test data before using predict() function to predict the values. When I try to apply the predict function to test set I get the following error “ValueError: could not convert string to float: ‘Normal’”
I appreciate your help in resolving this issue.
Thanks !
[1] https://www.kaggle.com/c/house-prices-advanced-regression-techniques