Another treat! Early access to Intro To Machine Learning videos

Yes, I find it has many gems (i.e. tricks of doing things faster and/or better) which Jeremy has personally collected over 25 years of ML practices and you could not find them in textbooks or elsewhere.

At 30:35 of lesson 2, Jeremy gets a random sample of 30k rows. He then says the validation set should not change, and that the training set should not overlap with the dates (not sure which dates he is referring to).

The original validation set is made up of the last 12k rows. Since proc_df is run on a subset of a random 30k rows, isn’t is possible that some of the new, smaller training data consists of rows from the validation set? Furthermore, I would think that the smaller training set is not necessarily ordered by date any longer since rows were picked at random.

edit: I checked out the source code, and get_sample returns the data in sorted order so that addresses that question. I still think it’s possible that the training data could overlap with the original validation set.

@Callan99
Change in line 15 of text.py:
from
texts.append(open(fname, 'r').read())
to
texts.append(open(fname, 'r', encoding='utf-8').read())

3 Likes

@jeremy I am confused whether to do Machine learning course or deep learning course first here?? which do you think will be better to do first??

They are different. If you don’t have any experience with dataset manipulation, cleaning and validation set creation, do the ML1 course first because that knowledge is assumed in the DL1 course. Personally I felt like it worked well for me to do ML1 followed by DL1. Also, fyi Jeremy has requested not to be personally tagged in posts unless he is the only person who can answer the question.

1 Like

@jeremy

Thanks for loading the rest of the ML Class videos - they are really great. Will the notebooks for the ML lessons 6-12 be released on github?

One silly question. I’ve just completed the 1st part of ML course. I’m really compelled to ask when the 2nd part is going to be available even in a non-official way? Thanks so much for all these courses

There’s no 2nd part of the MOOC - just a 2nd part for masters students at USF.

Oh, I was really looking forward to it:sweat_smile: . Thank you so much for all the courses. I’m done with ML and DL1 and they’re one the best courses of ML I’ve ever taken

There is a small inconsistency/bug in both:

ml1/lesson2-rf_interpretation.ipynb
ml1/lesson3-rf_foundations.ipynb

this:

df_raw = pd.read_feather('tmp/raw')

should be replaced by:

df_raw = pd.read_feather('tmp/bulldozers-raw')

since this is what ml1/lesson1-rf.ipynb used to save the data. or alternatively the first notebook should save its data as ‘tmp/raw’.

update: two more notebooks have the same issue:

bulldozer_dl.ipynb
bulldozer_linreg.ipynb

so probably should just fix the first notebook (ml1/lesson1-rf.ipynb) to save data as ‘tmp/raw’ instead of changing 4 notebooks. on the other hand ‘tmp/raw’ could collide with another lesson that may use tmp/raw.

Thanks.

Hi, @yinterian. Did you publish the Jupyter notebooks for the other lectures given at USF? If so, could you please tell us where? Any learning materials are much appreciated!

Hey Jeremy

I am a novice to machine learning, will this course help me in getting the basics right

Little bit of acquaintance with the ML terms will help…

Also guys don’t use @ Jeremy…

@mayank.ai,
The short answer is yes.
This course will provide you the basics of ML in all glory. Its designed to be a good learning platform to understand ML fundamentals / RandomForests / Decision Trees/ Naive Bayes / Logistic Regresssion in great detail. Those practical details are quite hard to get anywhere else. The latter part of the course introduces you to PyTorch which shall help in the Part1 of the Deep learning course.

You can surely appreciate whats mentioned in the books suggested in the Lesson 1 notebook after going through this course.

Interesting, I change the function parallel_trees with multiple-threads but not process. It works, not sure why though.

def parallel_trees(m, fn, n_jobs=8):
        return list(ProcessPoolExecutor(n_jobs).map(fn, m.estimators_))

to

def parallel_trees(m, fn, n_jobs=8):
        return list(ThreadPoolExecutor(n_jobs).map(fn, m.estimators_))
1 Like

I’m working on building a random forest and I am getting really terrible results. I tried to just follow the lesson1 steps, but my data is giving me a -0.14 r^2. My question is what are the next steps at that point? Is it back to the drawing board or is there something I can do still? One thing I’m thinking is I might be able to do some sort of oversampling since I have 3.5% of my data equal to 1 and the rest is 0. Does that seem like a good next step with this terrible of a starting r^2? Or should I be looking at my data quality and make sure I am at least getting a decent r^2 out of the gate before trying to do anything extra?

Any advice on this would be much appreciated. Here is what my numbers look like:

m = RandomForestRegressor(n_jobs=-1)
m.fit(X_trn, y_trn)
print_score(m)
[0.08436980063282429, 0.192801180918272, 0.7956208204063915, -0.14288877498635988]

After looking into this further, I think the problem is with how r^2 is calculated compared to what I actually care about. Only 3.5% of my values are 1 so when I look at the predictions, I see a lot of Actual = 1 Predicted = 0.1.

This leads me to believe I should try to oversample to put the number of 1s up to 50% and then adjust that threshold later when I’m trying to decide what actually counts as a “1”.

On the bright side, I am getting a great r^2 for the RandomForestClassifier, but unfortunately the reason it’s good is because it is predicting almost all of them as 0.

I noticed that somewhere in Lesson 2 or 3 that you have mentioned if we use set_rf_samples(), we should not be using oob_score. So that rule still should be followed, or that issue is fixed?

While going through this discussion forum, I came across a few discussions on the bootstrap argument of RandomForestRegressor() function and also about the set_rf_samples().

I also misunderstood it in the beginning and reading the conversations just got me more confused. So, I decided to dig a bit deep into fast.ai and sklearn source codes and came up with the following conclusions -->

n = no_of_rows_in_dataframe
if (bootstrap == false) {
  then all `n` rows are considered exactly once per tree for training
}
else if (set_rf_samples(k) is used) {
  then `k` rows are selected per tree for training & there might be some repetitions of rows
}
else {
  then `n` rows are selected per tree for training & there might be some repetitions of rows
}

Also, there were some ambiguity around the oob_score calculation. So, after exploring a bit, here’s what I concluded —>

/**************************************************************************************	
	for simplicity assuming output corresponding to each input is a single number.
	So, y.shape = (n, 1)
	y = actual outputs
	n = no_of_rows_in_data_frame
	
	For cases with a output vector, the oob_score can be calculated by simply taking average of oob_score 
    of each column of the vector.
****************************************************************************************/

total_prediction = zero_matrix of dimension (n x 1) /* used to accumulate total predictions for each row (by different trees in the forest) which will later be averaged */
  
no_of_predictions = zero_matrix of dimension (n x 1) /* total number of predictions for each row (which also represents total number of trees in which each row is Out-Of-Bag), used for averaging later */
  
for (tree in forest) {
	out_of_bag_samples = all_rows - set(rows used by `tree` for training)
    total_prediction += tree.predict(out_of_bag_samples)
    no_of_predictions = (increased by 1 for each row which was in out_of_bag_sample)
}

predictions = total_prediction / no_of_predictions
oob_score = r2_score(y, predictions)

For exact code of oob_score calculation, refer here

2 Likes

great resources
thank you @jeremy sir
you rock :wink:

I’ve just published the Lecture 1 Notes (With Jeremy’s permission).
Hope these are helpful and please do point out any points that could be corrected/improved.

I believe I’ll be able to share all the Notes before the End of this month.

Sanyam.

5 Likes