During lesson 5, Jeremy mentioned that when setting up dummy variables for tabular data, we include a constant for k-1 dummy variables. Meanwhile, if you have k dummy variables, you don’t need a constant. Why do we have to have a constant for k-1 dummy variables?
Let’s consider an example where we have a table of values with dependent variable
income (y) and independent variables
age (x1) and
marital status (x2).
age takes numerical values and
marital status can be single, married, or divorced.
Let us introduce
k = 3 dummy variables for
marital status - x21 = 1 if single, x22 = 1 if married, and x23 = 1 if divorced. Our regression function will look like:
y = a1*x1 + a21*x21 + a22*x22 + a23*x23
If we choose to use k - 1 = 2 dummy variables, we can eliminate x23 from above equation to rewrite it as:
y = a1*x1 + (a21 - a23)*x21 + (a22 - a23)*x22 + a23
or, y = a1*x1 + a21'*x21 + a22'*x22 + a23
Now a23 coefficient becomes the bias which we’ll have to estimate.