Lesson 5 In-Class Discussion ✅

marcmuc · November 20, 2018, 1:42pm

yes it does, it uses the settings from the DataBunch, (which is why if you change something with your databunch you should rerun it), one of the parameters there is the batch size. As the lr finder runs through a number of batches and adjusts the lr after each batch, the batch size also has a very direct influence on it!

KarlH · November 20, 2018, 5:09pm

Here’s an interesting paper looking at the generalization gap caused by batch size differences and proposes a few tweaks to avoid the gap. This paper finds that running more epochs with a higher batch size can achieve similar generalization and argues that it’s not the batch size that matters but the number of gradient updates. The authors also suggest using learning rate scaling with batch size, although they use square root scaling rather than linear scaling.

Also some interesting points on overfitting. The authors find they can improve generalization by continuing to train the model after the validation loss plateaus.

brismith · November 20, 2018, 9:56pm

Will @jeremy also cover how progressive resize works with the different architectures @sgugger For example resnet34 size of 224 was suggested as best for transfer learning as this was the trained size, and resnet50 - 299 - how is this reconciled with 64, 128, 256 type progressive sizing? Or is that best for from scratch CNNs and not transfer?

gbecon · November 21, 2018, 5:49am

Sometimes it may be possible to transform a tabular data task into an image task. E.g. movie ratings data could be translated into an image where x coordinate are users, y coordinate are movies, while ratings could be translated into pixels colors. Since the actual position of pixels and their grouping (correlation) is what important for images (typically similar pixels are close to each other in an image), this would only make sense if there is additional Meta data on users and movies, so that both users and movies could be meaningfully ranked (grouped). Not sure if this has been tried before, but technically you could use a resnet model on that. Maybe this is worth a try

marcmuc · November 21, 2018, 4:22pm

I have used the tabular learner for a raw numbers only dataframe without cat vars and therefore without embeddings. So basically creating a plain neural net with a few layers using fastai.tabular. A few tweaks are necessary, but overall it works fine. Head over to our Time Series Learning Competition thread to check it out! (shameless advertising! )

jeremy · November 21, 2018, 5:54pm

More specifically: embedding() creates a weight matrix. The params to the function are the dimensions of the matrix. We normally say row by columns when describing a matrix. So the above function creates a # users by 1 sized weight matrix.

jeremy · November 21, 2018, 5:55pm

That won’t generally help, since your OS will already cache for you.

jeremy · November 21, 2018, 5:58pm

There isn’t one table showing it - you have to grep the code to find them, like so:

Easiest is just to create a learner, and then check learn.loss_func to see what was used.

luffylucky · November 22, 2018, 11:51pm

I get an error while trying to print out a TabularList: ‘TabularList’ object has no attribute ‘codes’.
Digging into code source, I find this line: codes = [] if self.codes is None else self.codes[0].
But apparently, TabularList has no attribute ‘codes’. Can somebody look into this?

Jaghachi · November 23, 2018, 1:11am

Oh all this is great information helping me clear up the understanding of batch size.

@lesscomfortable also found a concise explanation on this matter reading through this notebook.
I believe this is what @jcatanza was essentially saying as well.

What does the loss tell us?

The loss is very noisy! While decreasing the batch size we increased the number of learning steps. Hence our model learns faster. But… with smaller batch size there are fewer samples to learn from, to compute gradients from ! The gradients we obtain may be very specific to the images and class labels that are covered by the batch of the current learning step. There was a tradeoff we made . We gained more learning speed but payed with a reduced gradient quality. Before increasing the batch size again and waiting too long for predictions we might improve by choosing another way:

Weight regularization

Gradient clipping

source link: Protein Atlas - Exploration and Baseline | Kaggle

vitaliy · November 24, 2018, 3:39am

I am a bit confused about math - lesson2 SGD correspondence.
SGD formula says w_t = w_{t-1} + dL/dw_{t-1} which is crystal clear.
However in update() function loop we have:
a.sub_(lr * a.grad). a are our weights tensor. why a.grad and where is the loss derivative here?

Borz · November 24, 2018, 6:07am

PyTorch does automatic differentiation, so a torch tensor stores the derivative of the computation done on it. So a.grad is the change in the weight, or parameter, and we’re multiplying it by the learning rate. Where the loss derivative is: I think that’s a.grad right? The gradients are calculated in the backward pass according to the loss value.

vitaliy · November 24, 2018, 6:08am

this makes sense! I didn’t know about automatic differentiation feature in pytorch. thanks!

jdaniels · November 25, 2018, 1:00am

Jeremy says in lesson 4 that the sigmoid activation applied to the output of the model is the only non-linearity.

robert.hryniewicz · November 25, 2018, 4:38am

FYI

I found this Coursera series on Hyperparameter tuning, Regularization and Optimization by Andrew Ng to be a great followup to Jeremy’s session (no registration required).

Topics covered:

PierreO · November 25, 2018, 8:31am

I also found that paper on the same subject

karan · November 26, 2018, 9:24am

can any one provide what structure I should follow to fine tune the language model?
any suggestions? @sgugger

whatrocks · November 28, 2018, 12:50am

Since it’s unclear and seemingly guesswork determining what features “actually” represent in the real-world, how can we determine whether or not our models are prioritizing features that might be ethically problematic?

Preka · December 7, 2018, 12:30pm

Did you solve it? I am encountering the same problem.