Could you please clarify the difference between dataloader and dataloaderS ?
Is there a fastai function to figure this out based on the dataset to maximize batch size based on GPU mem?
As the s imply, dataloaders is an object with several dataloader: one for training and one for validation.
Simply put, DataLoaders
is a wrapper for multiple DataLoader
's (ex: train and validation dataloader)
Why do you stack the 3s and 7s dataset again? On top of each other.
That is not entirely true, as we have rewritten the Pytorch DataLoader in fastai,
You want to train your model on 3s and 7s together, or it wonât learn to differentiate between the two of them.
Not yet no. There was in fastai v1, and you can probably import it while waiting for it to be ported to v2.
I donât believe so. It doesnât take too much trial and error to figure it out. GPU usage is depending on both your dataset items and the model you choose.
I think this was the one used in fastai v1:
Broadcasting was used when adding the bias vector to the weight matrix.
Can you think of other parts of the training process where broadcasting is used?
In the simplest model of SGD (the function called âtrain_epochâ, the for loop is based on an iterator âdlâ, but that is not passed into the function. How does the function get that variable?
Itâs defined in the notebook.
Maths are a bit rusty⌠Why the name linear? The bias isnât making the matrix multiplication non-linear?
Think of it as y = mx + b
Technically, the bias makes it affine, but people still often say linear.
y = mx + b. Is still just a linear function. m for the slope. b for just shifting the line up and down.
y = mx^2 + b would be nonlinear because of the ^2
Edit: I never realized that linear was incorrect and should be affine as sylvain notes. I feel betrayed by conventional vocabulary.
Whatâs an affine?
Just a linear with intercept?
Yes, but donât get too distracted by the names They are not super important.
By using the non-linearity, wonât using a function that makes all negative outputs to zero make many of the gradients in the network zero and stop the learning process due to many zero gradients?