Another treat! Early access to Intro To Machine Learning videos

From fast.ai github repository

CPU only environment
Use this if you do not have an NVidia GPU. Note you are encouraged to use Paperspace to access a GPU in the cloud by following this guide.
conda env update -f environment-cpu.yml
Anytime the instructions say to activate the Python environment, run conda activate fastai-cpu or source activate fastai-cpu.

2 Likes

Thanks, @gerardo for all the help :smiley:

1 Like

In ML lesson 4 [6:30]. Jeremy indicated that when decreasing the set_rf_samples number, we are actually decreasing the power of the estimator and increasing the correlation. But I think the correlation should decrease right ? Because in this case, we are less likely to chose the same row for each individual tree.

Actually, he put a decreasing arrow but said increasing so I’m quite confused.

I’m sorry if the question is so basic, but there are some information in ML course quite hard for me to understand.

For set_rf_samples function. If the data set is not too big, we can quickly process the whole dataset, then should we use set_rf_samples ? or we build tree with all data. Or we try both to see the result because with subsample we have less correlation but less accurate for each tree.

1 Like

Very basic question. I ran proc_df() both on train data as well as test data but I am getting different columns numbers and thus I cannot run my model against the test data.

The only thing different that I can find is that for test data, it has 1 less column since that is the dependent variable/y value. I am using the data from a housing price Kaggle comp.

Let me what could cause the mismatch in column numbers! Thank you :slight_smile:

For those who might be encountering a similar issue… I found the solution in a another post here: Proc_df() for machine learning course

I successfully submitted results to Kaggle and ranked at 2882 out of 4379 entries

By using the concepts from week #2, I was able to move up by 300 ranks (from 2883 to 2533).

In RandomForestRegressor, if estimators>1, then what could be the size of each individual tree? How many train samples are considered for constructing each tree?

Size of the tree depends upon the max_depth . If max_depth =None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.

In lecture 8 , what does the function set_lrs do ?
The notebook says set_lrs(opt, 1e-2) .

I am getting the exact same issue using a paperspace GPU machine, with the "df, y, nas = proc_df(df_raw, ‘SalePrice’) " command. i have shutdown the paperspace instance a restarted a few times, and clearly restarted the kernel several times.

Anyone figure this out?

Shep

I don’t think this is the same issue which I had, however on a different dataset when I ran proc_df on my training and my validation sets, I ended up with a validation set which had 92 columns vs my training set which had 80 columns.

After a LOT of head scratching, I discovered that in my validation set there were a number of columns with missing values, which didn’t exist in my training data. This resulted in extra columns in my validation set a “_na” extension on the name (e.g. “column_na”).

As mentioned, not the same issue, but might give you some ideas on where / how to hunt.

Todd

Personal opinion, but if your dataset is small enough to run very quickly when developing your model, then I would not use set_rf_samples. Looked at from the opposite side, if your dataset is so large that everytime you run it you have to stare at your screen waiting for it to process (i.e. you cannot easily interact with it), then creating your initial model with a sample is really useful.

How big is your dataset?

Hi Everyone,

In Lesson 3, @jeremy discusses the concept of feature importance. Around 1:16:00, he shows us two plots. The first plot has the feature importance with all the variables, and the second plot shows the feature importance with the variables which are more important. I don’t understand why, the feature importance value of the variable Coupler System is lower in the second plot than the first.

Regarding random forests, why is it that uncorrelated errors, when averaged out, lead to a low overall error. Why can’t it be that averaging out uncorrelated errors would lead to a high error? Can someone please explain?

For anyone who plays with Cython from lecture 7, here are a couple of tricks/tips which I learnt the slow way:

  1. You cannot have a comment preceding the %%cython declaration (note it can come AFTER the %%cython declaration)

  2. You cannot run %timeit in the same cell as the %%cython code, as it will produce an error. Needs to be included in another cell

There are no doubt some other ‘quirks’ with %%cython, but these were the ones which tripped me up initially.

Todd

P.S. anyone wondering why I have n = 2000**2000, I was just playing with larger numbers to see the impact.

2 Likes

Hoping to get advice/guidance on show to handle large files so that I can run random forest.

The data is 7GB and its from a Kaggle comp called TalkingData AdTracking Fraud Detection Challenge I was able to load the data by specifying the data type in a dictionary and passing that to read_csv() but as soon as I started trying to process the data, I started hitting memory errors. Specifically, I tried running add_datepart() and to_feather() For additional context, I am using Gradient on Paperspace with a GPU which has 30GB RAM & 8 cores. Given, this I was wondering what’s the best way to process large files and run Random Forests.

From what I searched on other forums threads, it seems like they are splitting files but I was hoping someone encountered a specific example that they can share here. Thank you!

Update!! - Found the following post which gave me the answers. Not sure why I didn’t find it earlier: Most effective ways to merge “big data” on a single machine

ML Lesson 1: I perform the same steps on the test data provided including train_cats but still while predicting, the model recognises some string data in the test set. How to get over that

How do i change the actual test set into categorical variables. I apply train_cats() on the test set but when i perform m.predict(test) , it shows that the the strings are unchanged?

I think he says somehwere that when you look at the dataframe it will still show you it in string format but actually when you m.predict it will use the number - I remember it being somewhere in the lectures I will look tonight.

:slight_smile:

ขอบคุนครับได้ความรู้เยอะเเยะเรยขอบคุนมากๆๆคับผมจะนำไปใช้ที่หลังคับ : M[size=1px]ufabet[/size]
[size=1px]สมัครufabet[/size]