What foundational topics would you like to hear more about?

Use of *args and **kwargs

I have some intuition around it, but can’t fully grasp when to use what.

2 Likes

Oh good one.

1 Like

What can the average and variance of the output of each layer reflect? Is the model training good or bad?

Batch Norm, Instance Norm, Group Norm, Layer Norm vs Weight Normalization?.

4 Likes

You also want the function to give good gradient juice in the right places.
For example, in Focal Loss for Dense Object Detection [0] they come up with a function that focuses on classes that are under represented.

[0] https://arxiv.org/abs/1708.02002

…and maybe weight standardization?

1 Like

One more vote for more detailed explanation on how to create a custom callback. What are the pieces we have to build, step by step, as though we don’t know much. What are the special things we have to do if we want to build a callback that descends from LearnerCallback, where we have to pass the “learn” object when we instantiated it. Once we understand callbacks, there are no limits to what we can do with the fastai library.

2 Likes

The majority of the time we just concentrate in the datasets already there.
I would like to see the best to tackle a problem from scratch and see the best way to handle the data engineering and then creation of the model.

1 Like

Cross entropy, by the way, has a really fun intuition :nerd_face:

Given an event that has probability p, information theory says (not sure I can come up with a better justification) that it’s interesting to look at the quantity

\log 1/p = - \log p

I think the \log 1/p version is easier to think about, but you’ll see -\log p more often. There are various names for this thing, but the one I like the most is to call it the “surprise” of the event. Here’s a nice post that covers the idea in more depth, but the name makes some sense: if an event has probability 1, then its surprise is \log 1/1 = 0, which seems about right; and as the event gets less and less likely (that is, as p goes to zero), the surprise gets bigger and bigger. And finally, it has the nice property that if you have two independent events with probilities p_1 and p_2, then their joint surprise works out to be just the sum of their individual surprises: \log 1/p_1 + \log 1/p_2. This works because the joint probability of two independent events is the product of their probabilities, p_1 p_2, and \log turns products into sums.

At any rate, with this interpretation, the cross entropy loss measures how surprised your model is by the training set, on average :slight_smile: As you train your model, you’re tweaking it so that it finds the training data less and less surprising.

More generally, anywhere you see a “negative log likelihood”, you can think of it as a surprise if you like.

This surprise concept is useful in other places too. For example, the KL divergence from one probability distribution p to another probability distribution q measures the “excess” surprise you would feel, on average, if you thought q was the right distribution for whatever you’re studying—when whoops, actually p is.

11 Likes

How to maintain backward compatibility for your existing models in production when you’re forced to upgrade Fast.ai in order to get new features or to fix bugs. For me, it’s been a nightmare having to retrain and redeploy production models when they suffered various errors then I check the forums and its says the solution is to upgrade fastai. However, when I upgrade them, the old learner I saved fails to load correctly and sometimes I have to change the syntax of my code. Then a few days later, I encounter a new error, check the forums , days, I need to update fastai again, and end up having to retrain my mofels yet again, ad infinitum.

I’d like to learn a more effective way to go about this.

I like that analogy :slight_smile:
When I hear “cross entropy”, my brain goes back to entropy vs. enthalpy in physics classes and that doesn’t make whole a lot of sense. Thank you for making it easier to remember!

1 Like

That does sound very annoying! You shouldn’t need to retrain your models - since we’re not doing things that need your weights to change. They’re still just plain pytorch models. Please do let me know if this happens again so we can better understand the issue and figure out how to deal with it.

3 Likes

There is a youtube video which explains entropy. Might be its helpful.

1 Like

I’m really trying to understand the source code by using the debugger and stepping through things, but I often times struggle to follow completely.

For instance, I’ve spent the last few nights trying to figure out how exactly the pct_start argument in OneCycleScheduler affects the learning rate shape. I know that it controls the number of interations where the learning rate is increasing, but I want to know how exactly it does that. As another example, I’m trying to figure out what transforms are applied to the response when tfm_y = True. I’m having trouble hooking into the right point in the library to even start the debugging exploration.

So in summary, I would be interested in knowing some tricks to better trace the fastai source code to get answers myself that aren’t necessarily in the documentation.

I also echo the requests for understanding callbacks better. I know they’re very powerful to enable customizations to the fitting process, but I have trouble understanding how they are managed by the CallBackHandler and when they actucally get executed.

1 Like

When I go to Kaggle many of the competitions are looking for the ROC curve.
I’ve seen videos and demos but I still don’t quite grasp the way to do it.

Is there any way to explain the best way to generate the ROC curve?

Not a great solution but I keep two Anaconda environments for fastai. One is a “bleeding edge” install that I keep updated for the Part 2 course. The other is for projects I’m working on that’s a few versions behind.

this guy also have an excellent way of explaining it and how it’s related to shannons information theory: https://www.youtube.com/watch?v=9r7FIXEAGvs

@hiromi: one problem with the loss functions you are considering is that they don’t have the first property that you mentioned – i.e. that the derivative w.r.t y vanishes for y = y_hat.

It doesn’t have to be zero there. It just needs to be a smaller number when they’re closer.

I really appreciated the *args and *kwargs explanation in lesson 10 :slight_smile:

I just found this fast.ai Wiki: Deep Learning Glossary - super helpful!

2 Likes