Another treat! Early access to Intro To Machine Learning videos

If you are using RandomForest. Maybe you can try to oversample the class until the distribution is balance? (I remember Jeremy have kind of talk about this trick, but I haven’t used it before)

1 Like

I guess (not sure) it won’t help per se because for RF it’s still the same thing(single copy or multiple copies…)?

@radek @alessa @ramesh @ericpb (sorry all)

First of all thanks for this series Jeremy. It’s pretty excellent. I have a doubt in the second lecture, that I hope someone can help me with.

In the fastai library, in the structured.py file there is this line on the starting imports bit

from sklearn.ensemble import forest

I’m looking in the sklearn API but I’m unable to find forest.

Someone please help me with this. All I am able to understand here is that set_rf_samples changes something in the API itself, and I’d like to understand how this works and if it’s something we can use with other algorithms like SVM?

Hi, also in the second lesson, while executing the cell:

draw_tree(m.estimators_[0], df_trn, precision = 3)

I get the following error:

failed to execute ['dot', '-Tsvg'], 
make sure the Graphviz executables are on your systems' PATH

I thought maybe graphviz isn’t installed, but import graphviz works in my fastai environment. I’m running this code on my windows 10 laptop. Also, since the whole thing is installed in the fastai conda environment, I don’t know if I should be messing around in the PATH.

see envs\fastai\Lib\site-packages\sklearn\ensemble\forest.py for documentation (below) and what it imports.

“”"Forest of trees-based ensemble methods

Those methods include random forests and extremely randomized trees.

The module structure is the following:

  • The BaseForest base class implements a common fit method for all
    the estimators in the module. The fit method of the base Forest
    class calls the fit method of each sub-estimator on random samples
    (with replacement, a.k.a. bootstrap) of the training set.

    The init of the sub-estimator is further delegated to the
    BaseEnsemble constructor.

  • The ForestClassifier and ForestRegressor base classes further
    implement the prediction logic by computing an average of the predicted
    outcomes of the sub-estimators.

  • The RandomForestClassifier and RandomForestRegressor derived
    classes provide the user with concrete implementations of
    the forest ensemble method using classical, deterministic
    DecisionTreeClassifier and DecisionTreeRegressor as
    sub-estimator implementations.

  • The ExtraTreesClassifier and ExtraTreesRegressor derived
    classes provide the user with concrete implementations of the
    forest ensemble method using the extremely randomized trees
    ExtraTreeClassifier and ExtraTreeRegressor as
    sub-estimator implementations.

Single and multi-output problems are both handled.

2 Likes

I just came across TPOT: https://epistasislab.github.io/tpot/
It looks like it tries a bunch of machine learning techniques from scikit-learn and tries to identify the best approach for your problem, once you’ve prepared the data.

2 thoughts:

  1. if this included all the fastai tips and tricks: wow!
  2. even if it doesn’t include all the fastai tweaks. it would sound relatively simple to turn this into a validation set tester. i.e. take 5 or so different models from TPOT, with 5 different scores, predict the test set on those 5 and show the valid_scores vs test_scores graph as shown in an ML lecture

I’m still too beginner to know where this sits between brain fart and a good idea. Let me know!

3 Likes

Installing python-graphviz definitely worked.

conda install python-graphviz

to be specific. Thanks @Brad_S

Also, TPOT looks pretty cool though I’m also in no position to judge.

1 Like

Here is one other thing I haven’t been able to figure out:

In the speeding things up section, where you pick out a subset of 30,000 samples from the original set, and then the first 20,000 of those for training, my r^2 score on the original validation set is surprisingly ~0.76, compared to ~0.86 that Jamie gets in his lecture. I can’t explain why this is happening at all. The only possible thing I can think of is that I’ve somehow downloaded the wrong data, but I can’t see how.

Since the data isn’t in the github repo, I got it from the competition page here: https://www.kaggle.com/c/bluebook-for-bulldozers/data

I downloaded the Train.zip file which is ~7 MB in size.

And then once you get to the section on OOB score, that is computed to be ~0.85 which is along the lines of what is shown in the lecture.

Finally once we start subsampling, things return to normal and the calculated r^2 scores are in line with what is seen in the lecture.

This doesn’t feel like it’s a major problem, since we will be using subsampling anyway, but I was wondering if this happened to anyone else and what is causing this.

Find my notebook Here: https://github.com/keshav-c/bulldozers/blob/master/lesson1-rf.ipynb

Edit: I was a bit too impatient here. The problem is addressed in the second half of lecture 3.

Hi all, I came across this article on “Towards Data Science” platform which is titled as “Machine Learning Zero-to-Hero: Everything you need in order to compete on Kaggle for the first time, step-by-step![1]”. This is definitely a good read. Yes Jeremy also covers this topic. I firmly believe that you’ll find this extremely useful.
[1] https://towardsdatascience.com/machine-learning-zero-to-hero-everything-you-need-in-order-to-compete-on-kaggle-for-the-first-time-18644e701cf1

1 Like

If by ‘Jeremy also covers this topic’ you mean ‘I learned most of what I know from Jeremy’ then yes, correct :wink:

Other than that, thanks a lot for the positive feedback!! :slight_smile:

2 Likes

@orendar it’s a great article! FYI I went to share it on twitter, but because your twitter handle isn’t in your medium profile, it didn’t credit you properly. You may want to add it.

You’re doing great! :slight_smile: Here’s the thing to think about: regularization penalizes coeffs that are larger. By using NB features, we don’t have to use such large coeffs to get the same result, compared to using plain binary features.

Once you’ve understood that, you’ll soon realize that NB-SVM still isn’t ideal - since we’d really like a zero coeff to represent our prior expectation as to the behavior of that feature. At that point, we can start to talk about the extension I made to NB-SVM which is the current state of the art in linear models for sentiment analysis! :slight_smile: (Which no-one has written up yet - so if you get to the point you understand this bit, you can be the first person to put it down in writing… You’re well on the way to being there.)

is there a way to follow you on twitter without being on twitter? rss to email or somesuch?
you know … just while I’m spending all my time on here learning like crazy and already being distracted by so much to read :slight_smile:

how did you manage to solve this ?

It’s not an error…
It always happen no matter what I do…

You can try adding
C:\ProgramData\Anaconda3\Library\bin\graphviz in user variable section of environment variables section.

When trying to execute

fi = rf_feat_importance(m, df_trn); fi[:10]

in Feature Importance section of Notebook 2 “lesson2-rf_interpretation” I am getting :

ValueError: arrays must all be same length.

Your full notebook will help…

Hi all,
I’m while attempting the Kaggle competition[1] “House Prices: Advanced Regression Techniques”. I pretty much follow the exact same approach and I managed to obtain a score of 0.94 with RandomForestRegressor. My next step was to use test data set (this competition provides both training and test data separately). I pretty much did the exact same things to the test data before using predict() function to predict the values. When I try to apply the predict function to test set I get the following error “ValueError: could not convert string to float: ‘Normal’

I appreciate your help in resolving this issue.
Thanks !

[1] https://www.kaggle.com/c/house-prices-advanced-regression-techniques