Will check that and report back but working through the CV notebook here is what I found.
dls[0], dls[1] and dls[2] are all Transformed Data Loaders. So accessing them that way seems to work.
Accessing them through subset only works for dls.subset(0).
Also when we pass in ds_idx = 2 to learn.validate it actually internally does below
when dl = None
Signature: learn.validate(ds_idx=1, dl=None, cbs=None)
Source:
def validate(self, ds_idx=1, dl=None, cbs=None):
if dl is None: dl = self.dls[ds_idx]
I have some questions about unet_learner, as we saw in the lesson you choose an architecture to be your encoder and then fastai automatically generates the decoder.
Only the encoder have pretrained weights? If so, why is that? Is it possible to have both the encoder and decoder pretrained?
Does the decoder changes depending on your encoder arch choice? Is a decoder with a resnet101 encoder going to have more layers than a decoder with resnet34?
When the learner is freezed, what are we training , the entire decoder I guess?
When unfreezed and we use differential learning rates, does the entire decoder gets assigned the same lr?
while zach replies to that - https://towardsdatascience.com/https-medium-com-reina-wang-tw-stochastic-gradient-descent-with-restarts-5f511975163 SGDR uses cosine annealing, which decreases learning rate in the form of half a cosine curve. This is a good method because we can start out with relatively high learning rates for several iterations in the beginning to quickly approach a local minimum, then gradually decrease the learning rate as we get closer to the minimum, ending with several small learning rate iterations.
At this point you briefly talk about gradient accumulation and I got a little bit confused.
When I say .dataloaders(bs=1) it means that Iāll be passing one image at a time to my model, but how many times the gradient is being accumulated before itās used for updating the weights? What parameter controls that?
Hi @igvaz, great questions. Got me thinking too.
1)Currently we only have the decoder with pretrained weights, i guess the only reason is because those are available online). Pretrained wieghts are helpful for fine-tuning which works well for similar tasks, so i guess having a pretrained decoder part will make sense if there are identical tasks to solve??. But in theory i donāt see why both canāt be pretrained. (This is what i understood).Progressive resizing is kinda using pretrained weights ???
2)Decoder doesnāt change with the encoder. You finally need to upsample and reach the original image size.
3)
4)
Iāll chime in here a moment and answer partially (and get to the rest too eventually). We are transfer learning, hence pretrained backbone. Then you could also then assume a pretrained front end due to the continuing to run after the size increase Yes, we chose a R34 because unet has special cuts to use for them. If we look at unet_learner we see:
@delegates(Learner.__init__)
def unet_learner(dls, arch, loss_func=None, pretrained=True, cut=None, splitter=None, config=None, n_in=3, n_out=None,
normalize=True, **kwargs):
"Build a unet learner from `dls` and `arch`"
if config is None: config = unet_config()
meta = model_meta.get(arch, _default_meta)
body = create_body(arch, n_in, pretrained, ifnone(cut, meta['cut']))
size = dls.one_batch()[0].shape[-2:]
if n_out is None: n_out = get_c(dls)
assert n_out, "`n_out` is not defined, and could not be infered from data, set `dls.c` or pass `n_out`"
if normalize: _add_norm(dls, meta, pretrained)
model = models.unet.DynamicUnet(body, n_out, size, **config)
learn = Learner(dls, model, loss_func=loss_func, splitter=ifnone(splitter, meta['split']), **kwargs)
if pretrained: learn.freeze()
return learn
So if we can do a create_body on any model, we can use it here (create_body makes an encoder)
@bwarner just to check if what iām thinking is correct:
We are using BCEWithLogitLoss which is basically BCELoss(sigmoid(raw_scores)) .
So raw predictions can be in any range, after applying sigmoid it is squeezed to [0,1] range, we use this for calculating our BCELoss along with the target (which are a bunch of 0ās and bunch of 1ās only).
After we finish training at inference/test time we will get the raw_scores, we apply the sigmoid activation to these raw_scores to squeeze them in the 0 to 1 range. We can now use threshold as follows: sigmoid(raw_scores) > threshold, then that class is present.
So threshold is being used only for predictions and show, does that mean we choose our threshold on how our test data is performing? (that doesnāt sound right).
Still confused how to choose threshold? @muellerzr could you please shed some light on this
Assuming total dataset size is 32, we set our batch size to 2. Gradients are fully redon/updated every 2 images, and in every one pass in our data 16 total times
But that is not really what I want to achieve with gradient accumulation.
Correct me if Iām wrong, but the idea is to not apply the gradient every 2 images, instead we calculate the gradient 2 images at a time but only apply them after N steps. In this way we can have a batch size of 1 that āfeelsā like a batch size of N (32 for example). This can have several advantages
for i, data in enumerate(trainloader, 0):
# get the inputs; data is a list of [inputs, labels]
inputs, labels = data
# zero the parameter gradients
optimizer.zero_grad()
# forward + backward + optimize
outputs = net(inputs)
loss = criterion(outputs, labels)
loss.backward()
optimizer.step()
i think what @Igvaz is asking (and what was asked in the video), you usually donāt want to train with bs=1 (it is unstable, batchnorm doesnāt work etc). So if there were no memory constraints and if you could train with a bs=8 that should be equivalent to training with bs=1 but accumulating the gradients for 8 images.(zeroing the grads only after these 8 images are dealt with 1 by 1).
In the code you shared are we updating gradients after we calculate for one image(as bs=1) or is fastai accumulating the gradients for certain number of images?
something like this-
model.zero_grad() # Reset gradients tensors
for i, (inputs, labels) in enumerate(training_set):
predictions = model(inputs) # Forward pass
loss = loss_function(predictions, labels) # Compute loss function
loss = loss / accumulation_steps # Normalize our loss (if averaged)
loss.backward() # Backward pass
if (i+1) % accumulation_steps == 0: # Wait for several backward steps
optimizer.step() # Now we can do an optimizer step
model.zero_grad() # Reset gradients tensors