Questions on unfreeze and learning rate (see graph)

In my practice for lesson 1 and 2 I try to tweak the input to the learning function. All to improve the accuracy for image classification.

The practice dataset (histopathology) contains about 200.000 images. The validation set contains 25-30% and the remainder is for training. It didn’t seem to make a big difference whether I used 75% or 100% of the data (about 0.003% on auc).

Besides tweaking the training sample size I tried to change:

  • the number of epochs
  • number of epochs before unfreezing()
  • the learning rates with max_lr and lr_range(slice())
  • using a very small learning rate AFTER unfreezing: learn.unfreeze() slice(1e-07,1e-05) for example.
  • data augmentation (playing with brightness(), contrast(), jitter(), flip_vert, rotate())

The difficulty is that the decrease in loss comes to a standstill or fluctuate around the same level after about 3 epochs anyway. The ‘tweaks’ seem nuances not relevant to the outcome. So I’m not sure about my influence on the validation loss after ‘playing’ with the input.

The experiments brought a couple of ‘thoughts’:

  • augmentation is of little use, because the size of the training set offers sufficiently diverse input?
  • what is the best moment to unfreeze the more basic layers?
  • how do sample size and number of epochs relate?
  • should I consider learning rate smaller than 1e-06?
  • why is the learning rate ‘flattening out’ on the graph towards the left (1e-06) all the time after 2 epochs? Does it suggest that only small improvement are to be expected?
  • is there some sort of a pipeline procedure to try different augmentations on a subset of the data? I’ve recognized that extreme augmentations don’t work, but the remainder I’m not sure.

The metrics over the first two epochs:

The learning rate after two epochs:

Last questions, how do I interpret the following picture on the third epoch? It seems like the learning process takes a detour into a useless area, wasting many batches of input?

Anyway, I try to optimize use of these parameters and understand it.

Currently I’m at the top 7% of the Kaggle competition, but I feel like I’m not improving despite using more data, different learning rates and data augmentation. I ran into the the pics above a couple of times now. The approach to my best score felt kinda generic (or maybe just the result of an simple/efficient and underappreciated FastAI library :slight_smile: )

1 Like

It seems you can train for longer, the accuracy is still improving. Try one long cycle instead of multiple small cycles. In the last picture the loss increases at the beginning because the learning rate is increasing (since you are using one_cycle). Maybe you can use a smaller learning rate. After unfreezing try slice(1e-06, 1e-05) instead of slice(1e-07, 1e-05), since the dataset is not any close to imagenet. It may or may not improve, is always good to try several options and see what is best.

Since the dataset is large you can make experiments on a smaller sample to find out what transformations work best and so on, then you move to the full dataset.

I hope this helps :slight_smile:

1 Like

Thanks. How do I increase the cycle length? The ‘length’ of an epoch itself seems completely determined by the size of my dataset. The the max learning rate influences within which margins the weights are updated? Ea. how bouncy the learning trajectory is?
Would learn.fit() instead of learn.fit_one_cycle() allow me to influence the cycle length better?

If a training set is dissimilar from Imagenet, shouldn’t I use larger learning rates? Because there is a bigger chance that there are larger differences in the dataset, which would benefit from a different neural net ‘body’?

Another questions: does it make senze to freeze() again after unfreezing and training the deeper layers?

Maybe questions are besides the point. I only know the terms body and head, so thats a start :slight_smile: .

EDIT: I just recognized that Jeremy addresses the topic in Lesson 3 as well.

I mean increasing the number of epochs, like learn.fit_one_cycle(10). It’s always just one cycle as long as you are using fit_one_cycle. The learning rate can be set based on lr_find. For example if, based on lf_find, I choose a learning rate of 1e-3 I would use (after unfreeze) slice(1e-3/10, 1e-3) if the dataset is not close to imagenet and slice(1e-3/100, 1e-3) otherwise (or try something in between).

The idea of freezing the body is because the head initially is randomly initialized. I don’t think freezing and unfreezing multiple times would benefit, unless you are changing the data like increasing the size. For example, you could start with 64x64 image size, train, unfreeze, train, then change to 128x128, freeze, train, unfreeze, train, change to 256x256 and so on.

1 Like

I experimented a bit with your examples and lesson 3 advices from Jeremy.

With the bigger dataset (100%) I find it difficult to find this ‘sweet spot’, where initially the loss goes up with the increasing learning rate, but then slowly goes down like a parabola.

After 2 epochs on 100% data it stays (almost) flat:

Should I use a higher max_lr to jump out of this spot?
Or should I just put one value instead of slice?

How does the lr_find graph looks like after that?

I don’t have a screenshot, but usually similar to the opening post. Sometimes a bit more spiky with a small slope from 1e-06/1e-05 to 1e-04. Will run it again this afternoon. Mostly it seems flat. I compare the lr_find() before and after unfreezing usually as well.

You can try to reduce regularization, i.e. reduce weight decay and dropout. To remove the weight decay you can call fit_one_cycle(1, slice(1e-8, 1e-5/2), wd=0). I don’t remember what’s the default value I believe it was 1e-2.

Okay interesting. Will try if it ‘works’.
Thanks for the ideas.

Hi all, I think I am running into the same issue (on the same dataset I presume? PCam from Kaggle?). I run lr_find after unfreezing. I too was getting a nearly flat learning rate graph if I trained for only 4 epochs before unfreezing. But if I trained for 8+ epochs, unfroze, then run lr_find, my graph comes out looking like below which I was expecting:


So here I can pick an lr slice fairly easily I think.

From my experiments I’m coming out with a lot of questions though

  1. I still don’t understand how the number of epochs affects this though?
  2. I also think I read somewhere that using a larger target dataset in a transfer learning setup means that the affect of unfreezing is less influential? Is this a possible reason why I get better results running more epoch cycles when the network is frozen rather than fewer cycles? Do I need to unfreeze at all?
  3. The dataset is composed of pathological scans, which are of a different nature from ImageNet. I assume this means that the earlier layers of ImageNet are going to be of more use than the later layers. That makes sense to me, but I am still not totally clear on which aspects of my training I would need to optimise, and how?

Hey,

I do not have answers to all questions.

Regarding the first point: the learning rate in- and decreases as it goes through all batches. These batches can be one epoch, or multiple epochs (which means that the same data is evaluated multiple times). The learning rate increases to the max learning rate and then slowly decreases again, no matter the number of epochs. With more epochs, this process is spread out over repetitions of the data basically. You can see this with the following function:

learn = create_cnn(data, models.resnet34, metrics=error_rate, callback_fns=ShowGraph)

Jeremy suggested in the course to look for a little bump in the loss, because it helps to get out of local minima/reach more stable functions. With trial- and error you can see how the different parameters influence this loss, while looking at the graphs!

Here is an example: