Another treat! Early access to Intro To Machine Learning videos

In lecture 8 , what does the function set_lrs do ?
The notebook says set_lrs(opt, 1e-2) .

I am getting the exact same issue using a paperspace GPU machine, with the "df, y, nas = proc_df(df_raw, ‘SalePrice’) " command. i have shutdown the paperspace instance a restarted a few times, and clearly restarted the kernel several times.

Anyone figure this out?

Shep

I don’t think this is the same issue which I had, however on a different dataset when I ran proc_df on my training and my validation sets, I ended up with a validation set which had 92 columns vs my training set which had 80 columns.

After a LOT of head scratching, I discovered that in my validation set there were a number of columns with missing values, which didn’t exist in my training data. This resulted in extra columns in my validation set a “_na” extension on the name (e.g. “column_na”).

As mentioned, not the same issue, but might give you some ideas on where / how to hunt.

Todd

Personal opinion, but if your dataset is small enough to run very quickly when developing your model, then I would not use set_rf_samples. Looked at from the opposite side, if your dataset is so large that everytime you run it you have to stare at your screen waiting for it to process (i.e. you cannot easily interact with it), then creating your initial model with a sample is really useful.

How big is your dataset?

Hi Everyone,

In Lesson 3, @jeremy discusses the concept of feature importance. Around 1:16:00, he shows us two plots. The first plot has the feature importance with all the variables, and the second plot shows the feature importance with the variables which are more important. I don’t understand why, the feature importance value of the variable Coupler System is lower in the second plot than the first.

Regarding random forests, why is it that uncorrelated errors, when averaged out, lead to a low overall error. Why can’t it be that averaging out uncorrelated errors would lead to a high error? Can someone please explain?

For anyone who plays with Cython from lecture 7, here are a couple of tricks/tips which I learnt the slow way:

  1. You cannot have a comment preceding the %%cython declaration (note it can come AFTER the %%cython declaration)

  2. You cannot run %timeit in the same cell as the %%cython code, as it will produce an error. Needs to be included in another cell

There are no doubt some other ‘quirks’ with %%cython, but these were the ones which tripped me up initially.

Todd

P.S. anyone wondering why I have n = 2000**2000, I was just playing with larger numbers to see the impact.

2 Likes

Hoping to get advice/guidance on show to handle large files so that I can run random forest.

The data is 7GB and its from a Kaggle comp called TalkingData AdTracking Fraud Detection Challenge I was able to load the data by specifying the data type in a dictionary and passing that to read_csv() but as soon as I started trying to process the data, I started hitting memory errors. Specifically, I tried running add_datepart() and to_feather() For additional context, I am using Gradient on Paperspace with a GPU which has 30GB RAM & 8 cores. Given, this I was wondering what’s the best way to process large files and run Random Forests.

From what I searched on other forums threads, it seems like they are splitting files but I was hoping someone encountered a specific example that they can share here. Thank you!

Update!! - Found the following post which gave me the answers. Not sure why I didn’t find it earlier: Most effective ways to merge “big data” on a single machine

ML Lesson 1: I perform the same steps on the test data provided including train_cats but still while predicting, the model recognises some string data in the test set. How to get over that

How do i change the actual test set into categorical variables. I apply train_cats() on the test set but when i perform m.predict(test) , it shows that the the strings are unchanged?

I think he says somehwere that when you look at the dataframe it will still show you it in string format but actually when you m.predict it will use the number - I remember it being somewhere in the lectures I will look tonight.

:slight_smile:

ขอบคุนครับได้ความรู้เยอะเเยะเรยขอบคุนมากๆๆคับผมจะนำไปใช้ที่หลังคับ : M[size=1px]ufabet[/size]
[size=1px]สมัครufabet[/size]

Uploading… This is after applying train_cats,and setting the UsageBand to codes

If I’m not mistaken you should use apply_cats() on the test set.

It still doesnt work. proc_df changes the categorical values into numbers. But since we don’t have a y variable in the test set, that won’t work. So how do i change the categorical data into number data?
I will appreciate if you share a link to one of your kernels showing the same.

I did a quick running example using the bulldozers dataset and RFs. I tried to show the before/after of each step.

I app-ly apply_cats to the test set. Then how do i change those categorical values to numbers as proc_df is only for the train set with target variable? Also if you could please share one of the kernels with this application

via .cat.codes and assigning them to the data frame…
It’s in the notebook .

i am getting the same issue at paperspace today, strangely yesterday it run fine, hmm… what is the issue and what could be the fix?

my setup:
MACHINE TYPE: P4000 HOURLY
REGION: CA1
RAM: 30 GB
CPUS: 8
HD: 34.7 GB / 250 GB
GPU: 8 GB

managed to get it work, had to comment out the save to feather for some reason…
image