Lesson 4 - Official Topic

Could you please clarify the difference between dataloader and dataloaderS ?

1 Like

Is there a fastai function to figure this out based on the dataset to maximize batch size based on GPU mem?

1 Like

As the s imply, dataloaders is an object with several dataloader: one for training and one for validation.


Simply put, DataLoaders is a wrapper for multiple DataLoader's (ex: train and validation dataloader)

1 Like

Why do you stack the 3s and 7s dataset again? On top of each other.

That is not entirely true, as we have rewritten the Pytorch DataLoader in fastai,

1 Like

You want to train your model on 3s and 7s together, or it won’t learn to differentiate between the two of them.

Not yet no. There was in fastai v1, and you can probably import it while waiting for it to be ported to v2.

I don’t believe so. It doesn’t take too much trial and error to figure it out. GPU usage is depending on both your dataset items and the model you choose.

I think this was the one used in fastai v1:

Broadcasting was used when adding the bias vector to the weight matrix.

Can you think of other parts of the training process where broadcasting is used?

In the simplest model of SGD (the function called “train_epoch”, the for loop is based on an iterator “dl”, but that is not passed into the function. How does the function get that variable?

It’s defined in the notebook.

Maths are a bit rusty… Why the name linear? The bias isn’t making the matrix multiplication non-linear?

Think of it as y = mx + b

1 Like

Technically, the bias makes it affine, but people still often say linear.


y = mx + b. Is still just a linear function. m for the slope. b for just shifting the line up and down.

y = mx^2 + b would be nonlinear because of the ^2

Edit: I never realized that linear was incorrect and should be affine as sylvain notes. I feel betrayed by conventional vocabulary.


What’s an affine?

Just a linear with intercept?

Yes, but don’t get too distracted by the names :wink: They are not super important.


By using the non-linearity, won’t using a function that makes all negative outputs to zero make many of the gradients in the network zero and stop the learning process due to many zero gradients?