Another treat! Early access to Intro To Machine Learning videos

That’s covered in our existing Computational Linear Algebra course.

2 Likes

@jeremy
Do you suggest studying that course as well for a better understanding?

Great questions!

That could be useful, yes, if you have that information. Better still would be to create a categorical column with that grouping information.

Distribution doesn’t really matter for random forest. It doesn’t have distribution assumptions. Since we’re adding a “{name}_na” column, the decision trees can split on that as required.

7 Likes

For linear-based models handling missing values is much harder and requires a lot of domain-specific details. My advice in this class was specific to tree-based models like RFs.

7 Likes

If you have the time, it certainly has some useful foundations for deep learning - but it only comes up when you’re doing fairly advanced stuff.

2 Likes

No, that’s only for USF’s masters students.

Yeah, right. You did mention at the start of the class itself about RF not making distribution assumption. I didn’t connect the dots. It makes sense now.

Thank you @jeremy, this is awesome indeed. Appreciate all.

That looks nicely done! I was planning to show kg download but your tool looks better :slight_smile:

2 Likes

Can we automate it to always download the recently launched competition dataset (provided it’s within a predefined size)?

1 Like

You can download the data from any Kaggle competition, you just have to put the name of you competition as well as the dataset you want to download as a parameter to the lib. For example for the test set on the https://www.kaggle.com/c/planet-understanding-the-amazon-from-space competition you will use the lib that way:

downloader = KaggleDataDownloader("Ekami", "somePassword", "planet-understanding-the-amazon-from-space")
output_path = downloader.download_dataset("test-jpg-additional.tar.7z", destination_path)

Here is how I used to to check if the data was already present and automatically download it otherwise. It’s not the most optimal way to check for datasets presence but it does the job :wink:

7 Likes

Thanks @jeremy for sharing these new lectures! I find your take on the curse of dimensionality and the no free lunch theorem very interesting, could you give some pointers to more of this line of thinking? Most of my understanding of statistical modeling are actually built upon these ideas so I’d really like to know more about your kind of counterarguments.

I’m not very familiar with random forests, and as far as I understands it, the no free lunch theory means that if you find a generic technique that works quite well, there is always some specific knowledge, about the data set your are working with and the domain you are in, that you can use to build better the models. I’d like to be proven wrong (or simply just naive).

Another thing about the curse of dimensionality, you are right that when the idea is taught in textbooks, they tend to use random data that don’t resemble real world data, which have intrinsic structures. I read in Kevin Murphy’s book that “provided it’s given a good distance metric and has enough labeled training data, … KNN classifier can come within a factor of 2 of the best possible performances if N approaches infinity.” I think the key here is “a good distance metric”. Because the real world data have its structure, they sit in an edge in the high dimension space, a good distance metric should weight more on the directions the data lie in, and less on the directions not so. I’m surprised to know that support vector machine does not work well because it seems that’s exactly what SVM is striving for.

3 Likes

Another question @jeremy , you converted the string variable (high, medium, low) to categoricals and then expanded them into dummy variables. Since they are ordered categoricals, what about simply treat them like continuous variables ((high=3, medium=2, low=1 for example), that way we still have the difference between the categories but there order are also preserved.

1 Like

A couple of questions from 50,000 feet:

1. What are the differences between machine learning and deep learning? Given a problem statement, how do we determine whether we have an ML problem or a DL problem?

2. How is the fastai framework going to differentiate itself from existing frameworks like scikit-learn? What are the pros/cons against established frameworks like sklearn that are pretty rich functionality wise?

(and thanks for the making the preview available!)

-wg

2 Likes

Thanks @jeremy for sharing the ML Video. So many practical tools - particularly the {}_na field is amazing to provide the model with the signal on what column values are imputed if it chooses to use that.

Hope you can continue to share these ML videos. Thanks again for everything you, Rachel and fast.ai is doing to make ML and DL accessible.

1 Like

Hi @wgpubs - I will try to explain from my perspective -

  1. I think of DL (more specifically Neural Networks) as one Tool in ML Toolbox. For some problems Neural networks a.k.a. DL is great. But for others, Tree based models like Random Forest (that Jeremy introduced in the ML video) will do just as well.

Couple of things to consider when deciding which model to choose DL vs ML -
a. Number of Examples for training - Many times it depends on How much Data you have. If you have limited data, then ML methods will do just as well or better generalize than NN. DL can overfit your data, so it’s important to have lots of examples.
b. Size of Dimensions - If you are dealing with Images, or Texts the size of dimensions or features might be too big for traditional models. This is not to say you can’t pass 5K dimensions to Trees or use Logistic Regression on BagOfWords. They can act as good base line models. But with so many features and lots of data, NN can do better, because it has larger number of Parameters to learn and can try to approximate a function that fits well.
2. FastAI Framework - ML/DL libraries provide three things - Pre-processing aka feature engineering, Model building and Evaluation. Fast.AI framework provides higher level wrappers to do these things quickly. Particularly the pre-processing tasks. Also, since each function only a few lines, we can learn how it’s done so we can modify or combine them for our own problems.

I am sure there are lot more to both these topics and Jeremy might touch on them in the ML or DL course. Just sharing the thoughts I had on the topic. The best way is for you to experiment and see what works.

18 Likes

Nice points @ramesh, may I add:

In ML, you have to come up with the best features in data and appropriate model to train it on.

In DL, you come up with an appropriate Neural Net architecture, and the net will try to learn the features based on data provided. (You have to pray though that it does a good job =)

-Anand

4 Likes

Hi @olivier

I don’t believe he converted to Dummy variables. df_raw.UsageBand.cat.set_categories(['High', 'Medium', 'Low'], ordered=True, inplace=True), can be thought of as “Replace HIGH with 2, Medium with 1 and Low with 0”. That’s all. Nothing magical about it. RandomForest and Tree based algorithms doesn’t need Dummy Variables. They don’t imply any Order to the values. It will basically split on condition and not assign any coefficient to the Feature.

In Linear or Logistic Regression, the model learns the weight (coefficient) for each parameter (features), and I can see how it can be a problem if we convert HIGH / Med / LOW to 2 / 1/ 0, because we are implying not only order but also the distance between these values. But in Tree based Models like RF, there’s not co-efficient. It’s learning the Rules to split on for each tree and no order is implied and hence it’s not required to covert them to dummy columns or even to standardize any values. These may be required in other models like Linear Regression.

10 Likes

Great answers, @Ramesh.

DL is simply a particular class of algorithms for ML.

2 Likes

Thanks for the comment @ramesh. Everything you said makes sense.

It seems to me that deep learning = neural nets with lots of data and more complex features (e.g., images, words, sentences), whereas machine learning = a more variety of algorithms with varying amounts of data and less complex features (e.g., an excel file).

Re: #2, so you are saying that the FastAI framework will have higher level feature engineering functionality than even sklearn? Data preparation, cleaning, feature engineering is such a huge part of ML/DL that I’ll definitely be paying attention to how this plays out.