In lesson 2, the coefficient of determination is explained. I am having a tough time building an intuitin about this concept.
The doubts i have are as follows:
-Firstly, what is the point of squaring the differences in the calculation of SS(tot) and SS(res)?
What does it actually say about the data?

In the second video, someone in the audience stated that R2 determines the percentage of variation that the model covers. Is this true?

Could someone care to explain this in detail or point me to somewhere i could read about it?

We square the differences in the calculation of SS(tot) and SS(res) for the similar reasons we square our differences in calculations of the standard deviation of any data. Squaring makes sure all our errors are positive and do not cancel out each other, if not for squaring and simply differences then a set of wrong predictions can have zero net error which is undesirable. You might want to take mod of the errors at this time but squaring is actually better. Also, if you’ve read about Euclidean distance which is a special case of Minkowski distance then you would know that the difference of squares compute a more distributed measure which have several desirable mathematical properties for future calculations.
You can read more about Euclidean distance here this is a great source to get more insights on a similar problem

Quoting Jeremy himself,

R² is the ratio between how good your model is (RMSE)vs. how good is the naïve mean model (RMSE)

in that sense higher the R2 score, better it is compared to the stupidest model possible which just gives the mean of data. I am unsure what you mean by “variation that the model covers” if you’re talking about variance of the data then higher the variance of data, higher the R2 square because mathematically:-
R2 = 1 - SS(res)/SS(tot)
In above, SS(tot) is proportional to the variance of data
Read more about variance on wikipedia

I also struggled with this. So, after thinking it over for long time, here’s how I think about "coefficient of determination. It is calculated as:

1 - (Error rate of your model)/(Error rate of Simplest model)

where - simplest model is to output the mean/average value as the prediction
and, your model is the current model – using RF, SGD, XGboost etc.

Value can be anything less than and equal to 1. There are values of special significance:

If the value is 1, then your model is predicting every value correctly. There is a high chance your model is overfitting. If you run it against your validation set there will be too much variance in results.

If the value is 0, then your model is no better than average model. So instead of running Random Forest and spending CPU time you can just take average and update every where.

If the values is less than 0, then the model is worse than submitting averages. So, again instead of spending CPU cycles you are better off predicting average value.

So, you will want the value to be between 0 and 1.

If you think about it, values less than 1 will measure the percentage of “variance” our model.