CPU only environment
Use this if you do not have an NVidia GPU. Note you are encouraged to use Paperspace to access a GPU in the cloud by following this guide.
conda env update -f environment-cpu.yml
Anytime the instructions say to activate the Python environment, run conda activate fastai-cpu or source activate fastai-cpu.
In ML lesson 4 [6:30]. Jeremy indicated that when decreasing the set_rf_samples number, we are actually decreasing the power of the estimator and increasing the correlation. But I think the correlation should decrease right ? Because in this case, we are less likely to chose the same row for each individual tree.
Actually, he put a decreasing arrow but said increasing so I’m quite confused.
I’m sorry if the question is so basic, but there are some information in ML course quite hard for me to understand.
For set_rf_samples function. If the data set is not too big, we can quickly process the whole dataset, then should we use set_rf_samples ? or we build tree with all data. Or we try both to see the result because with subsample we have less correlation but less accurate for each tree.
Very basic question. I ran proc_df() both on train data as well as test data but I am getting different columns numbers and thus I cannot run my model against the test data.
The only thing different that I can find is that for test data, it has 1 less column since that is the dependent variable/y value. I am using the data from a housing price Kaggle comp.
Let me what could cause the mismatch in column numbers! Thank you
In RandomForestRegressor, if estimators>1, then what could be the size of each individual tree? How many train samples are considered for constructing each tree?
Size of the tree depends upon the max_depth . If max_depth =None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.
I am getting the exact same issue using a paperspace GPU machine, with the "df, y, nas = proc_df(df_raw, ‘SalePrice’) " command. i have shutdown the paperspace instance a restarted a few times, and clearly restarted the kernel several times.
I don’t think this is the same issue which I had, however on a different dataset when I ran proc_df on my training and my validation sets, I ended up with a validation set which had 92 columns vs my training set which had 80 columns.
After a LOT of head scratching, I discovered that in my validation set there were a number of columns with missing values, which didn’t exist in my training data. This resulted in extra columns in my validation set a “_na” extension on the name (e.g. “column_na”).
As mentioned, not the same issue, but might give you some ideas on where / how to hunt.
Personal opinion, but if your dataset is small enough to run very quickly when developing your model, then I would not use set_rf_samples. Looked at from the opposite side, if your dataset is so large that everytime you run it you have to stare at your screen waiting for it to process (i.e. you cannot easily interact with it), then creating your initial model with a sample is really useful.
In Lesson 3, @jeremy discusses the concept of feature importance. Around 1:16:00, he shows us two plots. The first plot has the feature importance with all the variables, and the second plot shows the feature importance with the variables which are more important. I don’t understand why, the feature importance value of the variable Coupler System is lower in the second plot than the first.
Regarding random forests, why is it that uncorrelated errors, when averaged out, lead to a low overall error. Why can’t it be that averaging out uncorrelated errors would lead to a high error? Can someone please explain?
Hoping to get advice/guidance on show to handle large files so that I can run random forest.
The data is 7GB and its from a Kaggle comp called TalkingData AdTracking Fraud Detection Challenge I was able to load the data by specifying the data type in a dictionary and passing that to read_csv() but as soon as I started trying to process the data, I started hitting memory errors. Specifically, I tried running add_datepart() and to_feather() For additional context, I am using Gradient on Paperspace with a GPU which has 30GB RAM & 8 cores. Given, this I was wondering what’s the best way to process large files and run Random Forests.
From what I searched on other forums threads, it seems like they are splitting files but I was hoping someone encountered a specific example that they can share here. Thank you!
ML Lesson 1: I perform the same steps on the test data provided including train_cats but still while predicting, the model recognises some string data in the test set. How to get over that
How do i change the actual test set into categorical variables. I apply train_cats() on the test set but when i perform m.predict(test) , it shows that the the strings are unchanged?
I think he says somehwere that when you look at the dataframe it will still show you it in string format but actually when you m.predict it will use the number - I remember it being somewhere in the lectures I will look tonight.