Lesson 5 official topic

Indeed, if you have only numerical features such as price, we most definitely want to include a constant.

On the other hand, if you have a mix, let’s say price and sex. Then one-hot encoding the sex variable (and including all the categories (both male and female) in the equations) means you can already drop the constant,

e.g.

y = b1 * sex_male + b2 * sex_female + b3 * price

Instead of

y = a + b1 * sex_male + b2 * sex_female + b3 * price

In fact, alternatively, you could also do:

y = a + b1 * sex_male + b3 * price

In the latter you include the constant, but remove one of the one-hot encoded categories (in this case female).

However, whenever you then have multiple categorical features, it becomes a bit complicated to keep track of what to keep and what to drop, so I would always stick to the former way of encoding:

if you have at least one categorical variable that you want to one-hot encode: do that, include all the categories and drop the constant

1 Like

Now is clear. Mixing was the main confusing part that motivated original question

Thank you again

1 Like

Or maybe one more followup question.
We can think about NNs as extension of linear model.

  1. Does the same rule of thumb apply here? If we have linear layer do we skip bias according to that rule?
  2. Should we skip bias in all linear layers or just in the first one?

My intuition tells me that we skip bias just in the first layer, but not sure. Do we have some mathematical proof about this or can it just done empirically?

Yeah good question, im not entirely sure. But I don’t think we skip the bias in that case, certainly not in the later layers.

Answer from ChatGPT:

When training a neural network, do we skip the bias in a first linear layer in case we one-hot encode a categorical variable and include all the categories?

No, when training a neural network, we do not skip the bias term in the first linear layer, even when using one-hot encoding for categorical variables. The bias term is an important parameter in the linear transformation of the input features, which helps the model to fit the data better by shifting the activation function’s output.

In a neural network, the bias term is an additional learnable parameter that is added to the weighted sum of the inputs in each layer. The bias term allows the model to learn an offset from the origin, which can be critical for fitting the data. Without the bias term, the model may not fit the data accurately.

Therefore, in the case of one-hot encoding, we include all the categories, and each category gets its own weight and bias parameters. The bias parameter allows the model to adjust the output for each category independently, accounting for differences between the categories that cannot be captured by the weights alone.

In summary, the bias term is an essential parameter in a neural network that should not be skipped, even when using one-hot encoding for categorical variables.

:slight_smile:

2 Likes

Hey @lucasvw ,

Actually @jeremy answered my NN question in the recording: check this fragment out: Lesson 5: Practical Deep Learning for Coders 2022 - YouTube (link points to exact minute)
Seems he did exactly as my intuitive answer before (do not include bias/constant just in the first layer)

But few minutes later in the deep learning example he added constant also to the first layer (probably to make code easier + bias in first layer might not make that difference)
Tricky topic :smiley:

2 Likes

Hello Learners,

I have tried to run Jeremy’s notebook “Why you should use a framework”,
also i have experimented and doccumented my learnings in this Kaggle notebook
Hope you find this useful.

Let me know if you have any feedbacks.

hi,

is there a reason, why Jeremy speaks of ‘coefficients’ or ‘coeffs’ instead of ‘weights’, and a ‘constant’ instead of ‘bias’ in lesson 5?
i assume that’s what coeffs and constant are, eventually. weights and bias?

(also, i think at some point he speaks of ‘hidden activations/activators’ and then of ‘hidden layers’ - are these the same things?)

1 Like

and one other question:
to turn the linear model into a neural network, Jeremy adds one additional layer2 (and one const):


def init_coeffs(n_hidden=20):
    layer1 = (torch.rand(n_coeff, n_hidden)-0.5)/n_hidden
    layer2 = torch.rand(n_hidden, 1)-0.3
    const = torch.rand(1)[0]
    return layer1.requires_grad_(),layer2.requires_grad_(),const.requires_grad_()

am i right that the reason to have this new layer2 is to be able to bring the results ‘res’ back to a tensor with shape (n_coeffs,1) in the calc_preds function ( res = res@l2 + const)?

1 Like

I think Jeremy speaks of ‘coefficients’ and ‘constants’ instead of ‘weights’ and ‘bias’ because he is talking about a linear context and in this context each variable is multiplied by a coefficient and we add a constant, so the terms weights and biases are used in general, but we can use coefficients and constants too,

1 Like

so calc_preds(coeffs, indeps) takes two arguments coeffs and indeps,
coeffs is a tuple containing l1,l2 and const.
indeps is our input and its shape here is (713,12)
the shape of l1 is (12,20)
the shape of l2 is (20,1)
first we do: res = F.relu(indeps@l1) that’s a matrix multiplication (713,12)x(12,20) so the shape of res is not (713,20).
then we do res = res@l2 + const here again we have another matrix multiplication (713,20)x(20,1) so res’ shape is now (713,1)

so the final shape is (n_indeps,1) and not (n_coeffs,1)

ah yes, my question was more about why that second layer was added

this should be (713,1) instead of (713,20), though, i think?:

then we do res = res@l2 + const here again we have another matrix multiplication (713,20)x(20,1) so res’ shape is now (713,20)

so the purpose if l2 is to ‘flatten’ l1 down to (…, 1) (i assume)

1 Like

yes, sorry (713,20)x(20,1) gives (713,1).
also the purpose of additional layers in neural networks is not to flatten the result, if we wanted that we could have created a single layer with 713 inputs and only one output.
The purpose of additional layers in neural networks is to give the model more “freedom” and more parameters to learn, because the more parameters a model has the more “capable” it is to understand complex tasks.
the shape of those layers should only respect one rule: each layer’s input should be equal to the previous layer output. And the final layer’s output must be coherent with what we want to predict.
If for example we want to predict a single value then the last layer must have a shape of (…, 1). In the case of a multiclass classification (like for MNIST digit classification) we have last layer of shape (… , 10).

1 Like

Hi Team,

I just finished the lesson5(Training neural net from scratch) in part1 of the course. I was playing around with creating a neural network with 5 layers for the titanic dataset. I observed something very weird. Even though I am using the same learning rate, the model sometimes reaches convergence and sometimes it does not. For example: in the 1st run the model achieved an accuracy of 80% after 25 epochs while in the 2nd case it did not and it was just predicting the 0 label for all rows. Does this imply the random initialisation of weights seems to have impact on model convergence? How can this be addressed?

1 Like

Hey guys, just watching the lesson 5 lecture, and I have a doubt.

While creating the linear model, after we had done one forward pass, we normalized the independent variables tensor so that it is not dominated by the Age column. What is the difference between doing it then, and just doing train_df.Age /= train_df.Age.max() while we are preprocessing our data, and our linear model has not even begun being set up?
In practice, would these two steps lead to two different results?

No that’s not normal, the random weight initialization should give small changes in the inference results but not to the degree of predicting all zeros, can you share your code?

I guess that would be the same, and actually it’s better to do it while preprocessing data

Hey guys, I was re-implementing the Lesson 5 notebook when I had a doubt in something.

When using RandomSplitter(), Jeremy uses this notation:
RandomSplitter(seed=42)(df)

Can someone explain to me what this notation is, and why do we do it like that? I don’t get why or how we are able to ...)(df) because normally, programming something like this in Python would return an error.

Hey team, I am following through Jeremy’s Why you should use a framework and I have one question.

In the section of Submit to Kaggle, we are using the test dataset to calculate predictions. When we use the learner.get_preds, it returns a tuple of predictions and labels (and we discard the labels as we don’t have them in the test dataset, and we don’t need them).

I checked the shape of preds and a few values. And they look like below.

I see Jeremy uses all values in the second column as the prediction value.

I am not sure how to figure out which column to use. Maybe I missed something here, but not quite sure how to think through to choose the rigth column of preds.i looked into the source code of the learner’s get_preds, but it is a bit over my head.

Can anyone help me here to understand this better? :slight_smile:

The preds variable contains two columns: the first column contains the probability of class 0, whereas the second column contains the probability of class 1.

preds[0:10] returns the first 10 rows and all columns, while preds[:, 1] returns all rows and the first column.

3 Likes