During lesson 5, Jeremy mentioned that when setting up dummy variables for tabular data, we include a constant for k-1 dummy variables. Meanwhile, if you have k dummy variables, you don’t need a constant. Why do we have to have a constant for k-1 dummy variables?

Hi,

Let’s consider an example where we have a table of values with dependent variable `income`

(y) and independent variables `age`

(x1) and `marital status`

(x2). `age`

takes numerical values and `marital status`

can be single, married, or divorced.

Let us introduce `k = 3`

dummy variables for `marital status`

- x21 = 1 if single, x22 = 1 if married, and x23 = 1 if divorced. Our regression function will look like:

`y = a1*x1 + a21*x21 + a22*x22 + a23*x23`

If we choose to use k - 1 = 2 dummy variables, we can eliminate x23 from above equation to rewrite it as:

`y = a1*x1 + (a21 - a23)*x21 + (a22 - a23)*x22 + a23`

`or, y = a1*x1 + a21'*x21 + a22'*x22 + a23`

Now a23 coefficient becomes the bias which we’ll have to estimate.