Lesson 6 - Non-beginner discussion

This is the topic for any non-beginner discussion around lesson 6. It won’t be actively monitored by Jeremy or I tonight, but we will answer standing questions in here tomorrow (if needed).

When we have small batches the lr_find has nothing to really work off so it doesn’t give great suggestions. Could we just add the GradientAccumalation callback in the lr_find code?
Or is changing num_it (which effects the number of epochs) be changed ?
What is the best way to deal with small batches?


Can the DataBlock API pull batches from the DataSet using slices rather than one at a time when not shuffled? I found this situation when generating the fixed features for ROCKET time series classification.

@Pomo I’m going to show an example with this for combining fastai2 tabular and NumPy tommorow (writing the blog now).


Something Jeremy said is that if you see overfitting, instead of taking the best model, re-train with the n_epochs equal to the epoch the learner starts to overfit (retrain the cnn_learner with 8 epochs instead of 12 epochs)

Is there a reason to this? Jeremy said something like you want the learner to have a low learning rate at the final steps, but I don’t see how that impacts the performance of the final model. Has anyone done any experiments with using the SaveModelCallback vs re-training at the “ideal” number of epochs?

Practically speaking, if this is the case, that would mean (assuming no time/resource constraints), it would always be better to let the learner train for a large number of epochs, then do one final training at reduced number of epochs to get the best model possible?

1 Like

I’m reading the Cyclical learning rates paper and I want to know how they are used with adaptive optimizers like Adam. By my understanding, adaptive optimizers work by adapting the optimizer after every iteration. So how does that work with cyclical learning rates which also changes the learning rate after each iteration.

@kushaj can you help me with this

Say you are using Adam. You create an Adam optimizer as follows:

optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

As you can see from the above example we are passing in a learning_rate to the Adam optimizer. Now when we say Adam is an adaptive optimizer i.e. it adjusts learning rate for each parameter, it is w.r.t. to this learning rate.

In this example, we chose lr=0.001. In Adam, this lr=0.001 is multiplied with some value that is calculated using the higher order gradients (see algorithm code for details) for each parameter.

When we use Cyclic Learning we are changing the learning rate that is being passed to Adam in the first place. It is like doing this:

optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

# After one iteration get new_lr from cycle
optimizer = torch.optim.Adam(model.parameters(), lr=new_lr)

# After second iteration get new_lr from cycle
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

Ask for further clarification if needed.

1 Like