Lesson 6 - Official topic

wgpubs · April 23, 2020, 11:43pm

Yup makes sense. Sounds like the primary reason is for the augmentation bits mostly (e.g., to move the points to the right spot, or adjust a bounding box, etc…).

muellerzr · April 23, 2020, 11:43pm

Exactly

FraPochetti · April 24, 2020, 6:48am

This is really weird.
Can you please print movies.head() and ratings.head()?

maxmatical · April 24, 2020, 1:06pm

Something Jeremy said is that if you see overfitting, instead of taking the best model, re-train with the n_epochs equal to the epoch the learner starts to overfit (retrain the cnn_learner with 8 epochs instead of 12 epochs)

Is there a reason to this? Jeremy said something like you want the learner to have a low learning rate at the final steps, but I don’t see how that impacts the performance of the final model. Has anyone done any experiments with using the SaveModelCallback vs re-training at the “ideal” number of epochs?

Practically speaking, if this is the case, that would mean (assuming no time/resource constraints), it would always be better to let the learner train for a large number of epochs, then do one final training at reduced number of epochs to get the best model possible?

shahnoza · April 24, 2020, 3:54pm

Suppose I want to detect if the person wearing an eyeglass. So how would I approach to this problem. Is this binary classification? And how would structure dataset, e.g. should I get images of people wearing eyeglasses and people without eyeglasses? Thank you!

wgpubs · April 24, 2020, 5:41pm

Yes. See the dog/cat classification models from the book on how to set this up (that would be a good approach in terms of structuring things).

Yes.

And make sure you split your training set so that your validation set contains people not seen in the training set, and also that you have a good representation of folks with and without sunglasses.

balnazzar · April 26, 2020, 11:13am

Firstly, note that the graph is smoothed by means of a moving average, so it is “delayed” w.r.t. the true values. Plus, just after the min the 1st derivative changes its sign (by definition) so you don’t to pick your LR to be the minimum.

balnazzar · April 26, 2020, 11:18am

They cannot cover all the details in such a course, but the loss surfaces generated by NNs are very peculiar. Al lot of local minima are good enough.
See:

Also, see the paper by Choromanska and LeCun about the spin-glasses.

balnazzar · April 26, 2020, 11:21am

Afaik, every implementation around is inspired by Fastai.

balnazzar · April 26, 2020, 11:23am

You won’t find the “best” one anyhow. And maybe you don’t want to find it… You want to find a minimum which is good enough and generalizes well.

balnazzar · April 26, 2020, 11:28am

Would you link me the exact position in the video? Thanks.

balnazzar · April 26, 2020, 11:35am

Calling it a “gradient” is not advisable

Strange. I have no knowledge of low-level operations, but I thought that when it comes to stuff like releasing VRAM, it would be pure CUDA implemented at pytorch core level.
I mean, python should be just an interface here.

Search about “mixed precision training” if you want more details about the tricks referenced by sgugger. Briefly, some parts get done in fp16 and some others in fp32.

I think you are misremembering, but I may be wrong. Please link the relevant lessons/notebooks/discussions if you manage to find them.

Mh, interesting, but strange. In the end, the only metrics that should really matter is the valid loss. I always thought that an accuracy a bit swinging while the losses are still decreasing was a phenomenon of statistical origin. It would be very welcome if you could talk a bit more extensively about that.

I’m not sure I’m understanding your question the right way, could you elaborate a bit? In the meantime, please appreciate that:

The network you posted is not just made of linear layers. They are linear layer followed by non-linear activations.
Resnet18 is a conv net with residual blocks, something very different by the nn you posted.

How is that?

There was an old discussion between the two of us
As you duly reported in your article, Pascal is perfectly capable of supporting MPT without any penalty about convergence. But it will be only marginally faster w.r.t. a Turing card.
So, if you want to “almost double” your VRAM, you should use it on Pascal.

Very good answer with very good links.

jona · April 26, 2020, 8:10pm

If we see substantial drop in loss running multiple epochs with freeze ON, then why is the default to only fit with freeze for 1 epoch in fine_tune()?

It seems to me that we would want to run for some number of epochs until the benefit drops off (say to 10% of the first epoch), then open up the rest of the weights to training. Why is that intuition wrong?

DanielLam · April 26, 2020, 8:54pm

I would recommend trying out the scenarios, and see what happens.

From my experience with freeze_epochs, you can train something faster if you use more freeze_epochs, but the final accuracy will not be high as if you used freeze_epochs = 1. Note, if you use freeze_epochs=1, you generally need more fine-tuning cycles in order to see the accuracy get higher.

I don’t have a mathematical proof, but I’ve ran a few experiments. My gut feeling is that with higher freeze_epochs, certain neurons in the model.head layer become more and more important. Taken to a limit, only certain neurons will be used, and the rest become “dead”. I think when this happens, your model.head is overfitting already. For the model to generalize, I think you want all the neurons in use, and each neuron contributing a little bit to the output.

At least for vision, my experience is you want the coarse training stage to get you to maybe ~50-70% accuracy, and then let the fine-tuning get you the rest of the way there. This has given me the highest accuracy of various trials.

marii · April 27, 2020, 12:39am

~18:50

gautam_e · April 27, 2020, 1:03pm

Was there any information given regarding this? I’m asking since I watched the edited video only and could not watch the live stream.

pierreg · April 27, 2020, 5:18pm

Hello I have playing around with Movie Lenses 25M. With the following model:

bs = 1024
learn = collab_learner(dls,
use_nn=True,
y_range=(0, 5.5),
layers=[100, 100, 100, 100, 100, 50],
config={“embed_p”:0, “ps”:[0.5]*7})

I get a 0.67 valid loss.

I found some benchmarks, but nothing comprehensive, or for 25M.

Does someone know a page with benchmarks?

hallvagi · April 28, 2020, 8:35am

Note btw that the binary_cross_entropy() function in the video has a small error, but this is fixed in the fastbook-repo (inputs, and 1-inputs was switched):

def binary_cross_entropy(inputs, targets):
    inputs = inputs.sigmoid()
    return -torch.where(targets==1, inputs, 1-inputs).log().mean()

gautam_e · April 28, 2020, 11:12am

nn.BCEWithLogitsLoss differs from the binary_cross_entropy function in that

it is a module and not a function (its functional form is F.binary_cross_entropy_with_logits) and
it applies/includes the Softmax function before doing binary_cross_entropy

From the book:

F.binary_cross_entropy , and it’s module equivalent nn.BCELoss , calculate cross entropy on a one-hot encoded target, but do not include the initial sigmoid . Normally for one-hot encoded targets you’ll want F.binary_cross_entropy_with_logits (or nn.BCEWithLogitsLoss ), which do both sigmoid and binary cross entropy in a single function, as in our example above.

gautam_e · April 28, 2020, 5:28pm

I think the second and third arguments of torch.where should be interchanged. Also, if I’m not mistaken, there should be a negative sign before it. That’s just my guess, after looking at your post. I haven’t tried it out yet. Hopefully that helps. I can check it out later today, should it not be the reason for the discrepancy.

EDIT: Just realised that the same answer has been it has been proposed by @hallvagi in https://forums.fast.ai/t/lesson-6-official-topic/69306/355?u=gautam_e