Lesson 2 - Have a doubt about calculation of R2

kiranh · October 31, 2018, 6:02am

Hello,

In the second lesson on Random forests, the professor has the following piece of code:

def rmse(x,y): return math.sqrt(((x - y)**2).mean())

def print_score(m):
    res = [rmse(m.predict(X_train),y_train), rmse(m.predict(X_valid),y_valid),
          m.score(X_train,y_train), m.score(X_valid,y_valid)]
    if hasattr(m, 'oob_score_'): res.append(m.oob_score_)
    print(res)

The line: m.score(X_train,y_train), m.score(X_valid,y_valid), is supposedly calculating the R2(Co-efficient of determination). I am under the impression that R2 gives us a measure of how good our model is compared to a model which only predicts the mean of the target variable.

My question is: Why does the m.score take the training dataset as the first parameter? Is it because it needs to calcuate the mean of the target variable from the training set? Also, when you provide the first parameter as the traning dataset, does it again calculate the predictions of the target variable as it needs the predictions to calculate the R2?

Kindly clarify.

Regards,
Kiran Hegde