Another treat! Early access to Intro To Machine Learning videos

jeremy · October 27, 2017, 9:22pm

Very much so. You can see this in lesson 1. Lots of stuff being used that isn’t in sklearn, but is vital to getting good results.

wgpubs · October 27, 2017, 9:39pm

Very cool @jeremy.

I was going to ask how much data prep was going to be covered in the DL course since I find myself going full OCD when it comes to the question, “Is my data in the best possible format with the best possible features for my ML problem?”.

I’m glad to see that these things will be discussed in the ML course you’ve graciously made available to the class. If there is a way to contribute I’d love to hear about that as well since I have a few helper methods I use with pandas to both understand and clean data that might be of use to others.

jeremy · October 27, 2017, 11:30pm

Sounds interesting. If you could create a Kaggle kernel or Jupyter notebook showing some examples I’d love to steal any ideas that look helpful

miguel_perez · October 28, 2017, 12:27am

Jeremy, thank you for all this generous effort with all of us. A couple of thoughts I had while watching the video, in case can be of use:

I guess it is deliberate for clarity that you don’t mention the reason why random forests are so resilient to overfitting (ranzomization of rows by bootstraping and columns by random sampling). Maybe no time for that but its a pity… even if not necesary to use rf as a tool it was an eye opener for me when I understood how “randomness” led to accuracy.

About dates preprocessing, only caveat would be to be able to detect when dates are completely unrelated to ground truth and so the deeper the feature extraction from them the more represented will be overfitable date related noise in our dataset. (The cases in which date must be removed from the dataset, instead of processed)

Again, thank you, I enjoyed and learned, and again stole some hours to sleeping time… but it was worth it!

jeremy · October 28, 2017, 2:03am

We’ll be covering all those topics in detail.

olivier · October 28, 2017, 6:03am

Thanks @ramesh, you are right @jeremy didn’t convert it to dummy variables. I had extrapolated too much when he said the order doesn’t matter because the categorical variable will be interpreted as one to many comparisons, I just assumed it would be converted implicitly, because I read in sklearn documentation that “such integer representation (of categoricals) can not be used directly with scikit-learn estimators, as these expect continuous input, and would interpret the categories as being ordered, which is often not desired”, but I guess the tree-based algorithms must have their own way of dealing with categoricals.

Anyway this begs another question, if the categoricals are not augmented with orders that don’t exist, what would they do with the continuous variables? To the computer there aren’t any difference between them, they are all just numbers. It seems (I’m not entirely sure) that in tree-based algorithm the continuous variables are cut into discrete categories and then used like unordered categoricals in a similar fashion. If the categoricals are used with no orders implied as you said, then the continuous variables (and ordered categoricals) must have been used with orders abandoned. That would seem like a great waste of information.

It seems to be an over-do under-do conundrum: if the ordered categoricals are treated as continuous they assume distance between them which is too much, but if simply treated as unordered then it’s too little.

olivier · October 28, 2017, 6:23am

Great explanation for the randomness and its role in avoiding overfitting Miguel. I just want to add that another reason that random forest do not generally overfit is because it’s a forest, by being a forest it make predictions by averaging over the decision trees in it, and model averaging is very efficient in preventing overfitting.

This also might explain why support vector machine might underperform when compared to random forests. I think SVM and decision trees in a forest are analogous in that they are both sparse methods which prevents overfitting, and random forest adds another layer of averaging which makes it better than SVM. Sparse kernel methods with model averaging (relevance vector methods for example) might be a more worthy foe for random forests.

I’m just speculating because I have none hands-on experiences with these models. Glad to have more inputs!

olivier · October 28, 2017, 6:37am

b. Size of Dimensions - If you are dealing with Images, or Texts the size of dimensions or features might be too big for traditional models. This is not to say you can’t pass 5K dimensions to Trees or use Logistic Regression on BagOfWords. They can act as good base line models. But with so many features and lots of data, NN can do better, because it has larger number of Parameters to learn and can try to approximate a function that fits well.

Wouldn’t this contradict Jeremy’s argument that the curse of dimensionality is meaningless

miguel_perez · October 28, 2017, 10:04am

Olivier,

I see Jeremy has already anwered all this topics will be covered, so for sure in future lessons he will shed light on all this.

But as you are comparing SVM with RF I must warn that they are quite different in many ways, also regarding overfitting. Assuming we are talking about linear SVM, they will tend to underfit more than to overfit because of structural reasons Decision boundaries are linear and well, it is hard to overfit with a straight lines, just like with linear regression. (SVM with kernels are another story and can overfit like hell)

On the other hand, about your statement that “forests do not generally overfit”, well, they can overfit a lot. I explain my point: Random forest have a conceptual simplicity that almost looks like magic. But key to avoid overfitting is not just averaging . If you average correlated models/trees you end up overfiting. The key is the randomnes of this averaging. And there is a lot to think about this, 2 axis wise randomization… Breiman stated that column-wise sampling was an important “parameter” to tune, as it implies more or less correlation of created trees. And if you sample columnwise getting farther from 0.5 -0.6 range you get strongly correlated trees that, again, can overfit like hell.

miguel_perez · October 28, 2017, 10:40am

wgpubs, as you say, from 50,000 feet,

As for the ML vs. DL, in my opinion by now DL is a quite hyped term that refers to a subset of ML.

But the future might change this… I we were speaking of toys for a toddler, you have on one hand a doll house, made with every detail and that she just loves to play with… and on the other hand a bunch of lego pieces with which you also build a doll house, just not as easy to play with like the other one.

So they are both in the “toy” cathegory… but there is a key difference, mainly flexibily. and flexibility means potential. Given enough time and proper methodology with lego pieces you can create many things, including things that “trascend” the toy category and become somthing more than a toy.

So that metaphor is my 2 cents for this debate, DL is today a subset of ML with such built-in flexibility that makes it able to give a qualitative jump in -possibly not that distant- future.

olivier · October 28, 2017, 12:16pm

I come from a Bayesian background so when I talked about averaging I had posterior probabilities and information criteria in mind, it might be quite different with random forests, and I really have very limited knowledge on random forests… I’m sure many of my questions will be answered in later lectures so I’m looking for it! And it’s generally not a good idea to say generally anyway

wgpubs · October 28, 2017, 9:34pm

Here you go @jeremy …

Will be adding to this, but wanted to make available a few of my more used and interesting helper methods. Any feedback and/or recommendations from the community here would be very welcome.

sermakarevich · October 29, 2017, 6:16am

Always inspired by how good @jeremy is at explaining difficult things clearly, interestingly and deeply at the same time.

kolamazing · October 29, 2017, 8:12am

Is there anyone else who had issues with Crestle for following along with the first video? I noticed that Crestle doesn’t have up to date libraries which Jeremy was working through in the video. here’s what mine looks like:

Am I missing something?

kolamazing · October 29, 2017, 9:58am

Thanks for the tip…sadly its the same situation…

Have you got your crestle notebook working and all?

jeremy · October 29, 2017, 9:02pm

git clone https://github.com/fastai/fastai.git

binarypoet · October 30, 2017, 12:25am

FYI, I am working through this lesson using paper space, and I had to install a few extra things in order to make everything happy. Sharing here in case anyone runs into the same issue.

You will need to install the following:

pip install graphviz
pip install sklearn_pandas
pip install feather-format
sudo apt install graphviz

vikbehal · October 30, 2017, 5:52am

Import failing at line:

from fastai.structured import *

which is internally calling: import IPython, graphviz and throwing error:

ModuleNotFoundError: No module named ‘graphviz’

Should I remove ‘import IPython, graphviz’? I’m experimenting on crestle.

ecdrid · October 30, 2017, 8:19am

Why remove?
Have a terminal or use inbuilt commands in jupyter to install the them…
pip3 install ...

Robi · October 30, 2017, 1:36pm

I am trying Crestle. The pip3 install ... works fine but I cannot sudo apt install graphviz because it requires sudo privileges (that it seems are not granted to standard Crestle users).