Running fit_one_cycle(..) more than once

ulat · June 14, 2019, 7:32am

I wonder if there is a difference running fit_one_cycle(5, lr) a second time to running once fit_one_cycle(10, lr) ?

NathanHub · June 14, 2019, 8:11am

As the name suggests, the fit_one_cycle function will make a cycle of one learning rate increase phase followed by a decreasing phase. By running the function twice you will thus make two cycles. It is then called Stochastic Gradient Descent with warm Restarts and was the technique used in fastai over the last years. But a single cycle appears to give better results and is now the go-to approach for training a network.

There is a page about SGDR in the docs (https://docs.fast.ai/callbacks.general_sched.html), maybe you could try it and see if you get any difference.

jbo · June 21, 2019, 9:03am

hi Nathan:
In case I train with three cycles : learn.fit_one_cycle(3,max_lr=slice(1e-5,1e-4)) .

After training I see the train loss > valid_loss and I want to train more after the 3 previous cycles , so in this case I have to save the earlier model and train again or calling learn.fit_one_cycle with some parameters can restart the previous execution of learn.fit ?

Thanks

NathanHub · June 21, 2019, 9:54am

Hi,

If you call learn.fit_one_cycle several times, the training won’t restart at each execution but will continue.

Hope it helps

jbo · June 26, 2019, 8:46am

hi ,
In case I do an unfreeze before learn.fit_one_cycle, will it still continue learning from the place it left or since unfreeze make the model learn the initial layers it discards the previous learning and do it from fresh .

NathanHub · June 26, 2019, 8:54am

Yes, of course. The only thing that will change when unfreezing a model is that now, the bottom layers will be updated too, allowing your model to fit even better your dataset and leading to better performances.

Pak · June 27, 2019, 10:28am

The IS a difference between running fit_one_cycle(5, lr) a second time to running once fit_one_cycle(10, lr)
In my cases running fit_one_cycle(10, lr) once helped more. And there is a way to continue the same one_cycle (when for ex you have to reboot your PC or want train for several nights in a row) described here (you just set in cyc_len number of cycles in the current session, start_epoch is from which epoch you restart and tot_epochs is a total number of epochs you want to train across all the sessions).
I’ve tried it and it worked very good for me (except some visual bugs)

davidpfahler · October 14, 2019, 6:33am

Restarting does not seem to work for me. To test this I first ran learn.fit_one_cycle(3, max_lr=1e-2) and then to compare:

learn.fit_one_cycle(1, max_lr=1e-2, tot_epochs=3)
learn.fit_one_cycle(1, max_lr=1e-2, tot_epochs=3, start_epoch=1)
learn.fit_one_cycle(1, max_lr=1e-2, tot_epochs=3, start_epoch=2)

which does not train but produces the following error instead:

/usr/local/lib/python3.6/dist-packages/fastprogress/fastprogress.py:102: UserWarning: Your generator is empty.
  warn("Your generator is empty.")

Any help would be much appreciated.

asoellinger · December 28, 2020, 6:33pm

I have an elaboration of this same question here:

I am using fit_one_cycle on my Inception learner. That has the following architecture:

InceptionTime(
  (inceptionblock): InceptionBlock(
    (inception): ModuleList(
      (0): InceptionModule(
        (convs): ModuleList(
          (0): Conv1d(1, 32, kernel_size=(39,), stride=(1,), padding=(19,), bias=False)
          (1): Conv1d(1, 32, kernel_size=(19,), stride=(1,), padding=(9,), bias=False)
          (2): Conv1d(1, 32, kernel_size=(9,), stride=(1,), padding=(4,), bias=False)
        )
        (maxconvpool): Sequential(
          (0): MaxPool1d(kernel_size=3, stride=1, padding=1, dilation=1, ceil_mode=False)
          (1): Conv1d(1, 32, kernel_size=(1,), stride=(1,), bias=False)
        )
        (concat): Concat(dim=1)
        (bn): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (act): ReLU()
      )
      (1): InceptionModule(
        (bottleneck): Conv1d(128, 32, kernel_size=(1,), stride=(1,), bias=False)
        (convs): ModuleList(
          (0): Conv1d(32, 32, kernel_size=(39,), stride=(1,), padding=(19,), bias=False)
          (1): Conv1d(32, 32, kernel_size=(19,), stride=(1,), padding=(9,), bias=False)
          (2): Conv1d(32, 32, kernel_size=(9,), stride=(1,), padding=(4,), bias=False)
        )
        (maxconvpool): Sequential(
          (0): MaxPool1d(kernel_size=3, stride=1, padding=1, dilation=1, ceil_mode=False)
          (1): Conv1d(128, 32, kernel_size=(1,), stride=(1,), bias=False)
        )
        (concat): Concat(dim=1)
        (bn): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (act): ReLU()
      )
      (2): InceptionModule(
        (bottleneck): Conv1d(128, 32, kernel_size=(1,), stride=(1,), bias=False)
        (convs): ModuleList(
          (0): Conv1d(32, 32, kernel_size=(39,), stride=(1,), padding=(19,), bias=False)
          (1): Conv1d(32, 32, kernel_size=(19,), stride=(1,), padding=(9,), bias=False)
          (2): Conv1d(32, 32, kernel_size=(9,), stride=(1,), padding=(4,), bias=False)
        )
        (maxconvpool): Sequential(
          (0): MaxPool1d(kernel_size=3, stride=1, padding=1, dilation=1, ceil_mode=False)
          (1): Conv1d(128, 32, kernel_size=(1,), stride=(1,), bias=False)
        )
        (concat): Concat(dim=1)
        (bn): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (act): ReLU()
      )
      (3): InceptionModule(
        (bottleneck): Conv1d(128, 32, kernel_size=(1,), stride=(1,), bias=False)
        (convs): ModuleList(
          (0): Conv1d(32, 32, kernel_size=(39,), stride=(1,), padding=(19,), bias=False)
          (1): Conv1d(32, 32, kernel_size=(19,), stride=(1,), padding=(9,), bias=False)
          (2): Conv1d(32, 32, kernel_size=(9,), stride=(1,), padding=(4,), bias=False)
        )
        (maxconvpool): Sequential(
          (0): MaxPool1d(kernel_size=3, stride=1, padding=1, dilation=1, ceil_mode=False)
          (1): Conv1d(128, 32, kernel_size=(1,), stride=(1,), bias=False)
        )
        (concat): Concat(dim=1)
        (bn): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (act): ReLU()
      )
      (4): InceptionModule(
        (bottleneck): Conv1d(128, 32, kernel_size=(1,), stride=(1,), bias=False)
        (convs): ModuleList(
          (0): Conv1d(32, 32, kernel_size=(39,), stride=(1,), padding=(19,), bias=False)
          (1): Conv1d(32, 32, kernel_size=(19,), stride=(1,), padding=(9,), bias=False)
          (2): Conv1d(32, 32, kernel_size=(9,), stride=(1,), padding=(4,), bias=False)
        )
        (maxconvpool): Sequential(
          (0): MaxPool1d(kernel_size=3, stride=1, padding=1, dilation=1, ceil_mode=False)
          (1): Conv1d(128, 32, kernel_size=(1,), stride=(1,), bias=False)
        )
        (concat): Concat(dim=1)
        (bn): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (act): ReLU()
      )
      (5): InceptionModule(
        (bottleneck): Conv1d(128, 32, kernel_size=(1,), stride=(1,), bias=False)
        (convs): ModuleList(
          (0): Conv1d(32, 32, kernel_size=(39,), stride=(1,), padding=(19,), bias=False)
          (1): Conv1d(32, 32, kernel_size=(19,), stride=(1,), padding=(9,), bias=False)
          (2): Conv1d(32, 32, kernel_size=(9,), stride=(1,), padding=(4,), bias=False)
        )
        (maxconvpool): Sequential(
          (0): MaxPool1d(kernel_size=3, stride=1, padding=1, dilation=1, ceil_mode=False)
          (1): Conv1d(128, 32, kernel_size=(1,), stride=(1,), bias=False)
        )
        (concat): Concat(dim=1)
        (bn): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (act): ReLU()
      )
    )
    (shortcut): ModuleList(
      (0): ConvBlock(
        (0): Conv1d(1, 128, kernel_size=(1,), stride=(1,), bias=False)
        (1): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      )
      (1): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    )
    (add): Add
    (act): ReLU()
  )
  (gap): GAP1d(
    (gap): AdaptiveAvgPool1d(output_size=1)
    (flatten): Flatten(full=False)
  )
  (fc): Linear(in_features=128, out_features=1, bias=True)
)

I have been doing a manual exploration of different training strategies. The first observation I have is that if I create the learner and save the initial weights like this:

model = InceptionTime(dls.vars, dls.c)
learn = Learner(dls, model, metrics=[mae, mse], cbs=WandbCallback())
learn.save('0epochs')

I am finding that if I run

# 50 epochs
learn = learn.load('0epochs')
learn.fit_one_cycle(50, lr_max=1e-4, div=50.,)
learn.lr_find(stop_div=False)

In the 50th epoch, I get a training error of 15855.080078 and a MSE of 16197.121094.

Then I do the exact same thing again.

# 50 epochs
learn = learn.load('0epochs')
learn.fit_one_cycle(50, lr_max=1e-4, div=50.,)
learn.lr_find(stop_div=False)

In the 50th epoch I get a training error of 15845.706055 and an MSE of 16162.488281.

These metrics are different. Additionally, I am finding that there is also a bit of difference in the lr_find charts. Not much, but it’s there.

I am also trying to run fit_one_cycle multiple times with the same parameters to get 50 epochs, for example:

learn.fit_one_cycle(25, lr_max=1e-4, div=50.,)
learn.fit_one_cycle(25, lr_max=1e-4, div=50.,)

In the 50th epoch I get a training error of 15855.791992 and an MSE of 16139.482422.

My question is:

What is causing the differences here given that I am using the same initial weights for each approach? And more of an implementation question, in this dimension, is there a difference between calling fit_one_cycle 5 times with 10 epochs vs one time with 50 epochs?

Pomo · January 1, 2021, 3:00am

Hi Aaron,

The content of minibatches will differ between runs because they are drawn randomly from the dataset. That would account for the small discrepancy of loss and error.

As far as I recall, fit_one_cycle applies the learning rate scheduler one time across the number of epochs specified. So yes there is a difference between one cycle at fifty epochs and five cycles at ten epochs.

Good observation and questions!

HTH,