A walk with fastai2 - Vision - Study Group and Online Lectures Megathread

Srinivas · February 6, 2020, 8:12pm

Will check that and report back but working through the CV notebook here is what I found.
dls[0], dls[1] and dls[2] are all Transformed Data Loaders. So accessing them that way seems to work.
Accessing them through subset only works for dls.subset(0).

Also when we pass in ds_idx = 2 to learn.validate it actually internally does below
when dl = None

Signature: learn.validate(ds_idx=1, dl=None, cbs=None)
Source:
def validate(self, ds_idx=1, dl=None, cbs=None):
if dl is None: dl = self.dls[ds_idx]

So indexing through dls[0, or 1 or 2] works.

barnacl · February 6, 2020, 8:14pm

!git clone https://github.com/fastai/fastai2
%cd fastai2/
!pip install -e .[dev]
!git clone https://github.com/fastai/fastcore
%cd fastcore/
!pip install -e .[dev]
import os
os._exit(00)```
@Srinivas

Srinivas · February 6, 2020, 8:15pm

Thanks!

lgvaz · February 6, 2020, 10:30pm

I have some questions about unet_learner, as we saw in the lesson you choose an architecture to be your encoder and then fastai automatically generates the decoder.

Only the encoder have pretrained weights? If so, why is that? Is it possible to have both the encoder and decoder pretrained?
Does the decoder changes depending on your encoder arch choice? Is a decoder with a resnet101 encoder going to have more layers than a decoder with resnet34?
When the learner is freezed, what are we training , the entire decoder I guess?
When unfreezed and we use differential learning rates, does the entire decoder gets assigned the same lr?

foobar8675 · February 6, 2020, 11:06pm

@muellerzr is there a definition online for cosine annealing you can point me to?

barnacl · February 6, 2020, 11:11pm

while zach replies to that - https://towardsdatascience.com/https-medium-com-reina-wang-tw-stochastic-gradient-descent-with-restarts-5f511975163
SGDR uses cosine annealing, which decreases learning rate in the form of half a cosine curve. This is a good method because we can start out with relatively high learning rates for several iterations in the beginning to quickly approach a local minimum, then gradually decrease the learning rate as we get closer to the minimum, ending with several small learning rate iterations.

lgvaz · February 6, 2020, 11:13pm

At this point you briefly talk about gradient accumulation and I got a little bit confused.

When I say .dataloaders(bs=1) it means that I’ll be passing one image at a time to my model, but how many times the gradient is being accumulated before it’s used for updating the weights? What parameter controls that?

barnacl · February 7, 2020, 12:47am

Hi @igvaz, great questions. Got me thinking too.
1)Currently we only have the decoder with pretrained weights, i guess the only reason is because those are available online). Pretrained wieghts are helpful for fine-tuning which works well for similar tasks, so i guess having a pretrained decoder part will make sense if there are identical tasks to solve??. But in theory i don’t see why both can’t be pretrained. (This is what i understood).Progressive resizing is kinda using pretrained weights ???
2)Decoder doesn’t change with the encoder. You finally need to upsample and reach the original image size.
3)
4)

lgvaz · February 7, 2020, 2:44am

I also think this is the only reason, we can try some experiments to see how this goes

muellerzr · February 7, 2020, 2:47am

I’ll chime in here a moment and answer partially (and get to the rest too eventually). We are transfer learning, hence pretrained backbone. Then you could also then assume a pretrained front end due to the continuing to run after the size increase Yes, we chose a R34 because unet has special cuts to use for them. If we look at unet_learner we see:

@delegates(Learner.__init__)
def unet_learner(dls, arch, loss_func=None, pretrained=True, cut=None, splitter=None, config=None, n_in=3, n_out=None,
                 normalize=True, **kwargs):
    "Build a unet learner from `dls` and `arch`"
    if config is None: config = unet_config()
    meta = model_meta.get(arch, _default_meta)
    body = create_body(arch, n_in, pretrained, ifnone(cut, meta['cut']))
    size = dls.one_batch()[0].shape[-2:]
    if n_out is None: n_out = get_c(dls)
    assert n_out, "`n_out` is not defined, and could not be infered from data, set `dls.c` or pass `n_out`"
    if normalize: _add_norm(dls, meta, pretrained)
    model = models.unet.DynamicUnet(body, n_out, size, **config)
    learn = Learner(dls, model, loss_func=loss_func, splitter=ifnone(splitter, meta['split']), **kwargs)
    if pretrained: learn.freeze()
    return learn

So if we can do a create_body on any model, we can use it here (create_body makes an encoder)

muellerzr · February 7, 2020, 3:01am

Yes, the decoder
Decoder does get the same unless we pass in a slice

@foobar8675 exactly what @barnacl. Our’s anneals at that 72% threshold (75% is default) Which is 72% of the total batches, not epochs!

It’s updated on our minibatches (so the batch size)

barnacl · February 7, 2020, 4:50am

@muellerzr do we still use layer_groups ? couldn’t find it in the docs

barnacl · February 7, 2020, 5:22am

@bwarner just to check if what i’m thinking is correct:
We are using BCEWithLogitLoss which is basically BCELoss(sigmoid(raw_scores)) .
So raw predictions can be in any range, after applying sigmoid it is squeezed to [0,1] range, we use this for calculating our BCELoss along with the target (which are a bunch of 0’s and bunch of 1’s only).
After we finish training at inference/test time we will get the raw_scores, we apply the sigmoid activation to these raw_scores to squeeze them in the 0 to 1 range. We can now use threshold as follows: sigmoid(raw_scores) > threshold, then that class is present.
So threshold is being used only for predictions and show, does that mean we choose our threshold on how our test data is performing? (that doesn’t sound right).
Still confused how to choose threshold?
@muellerzr could you please shed some light on this

foobar8675 · February 7, 2020, 3:59pm

for picking the threshold, i dont believe there’s a science to it. i think you have to see how the data is performing like u said. for multilabel classification in v1 last year, it was set at .2 https://github.com/fastai/course-v3/blob/master/nbs/dl1/lesson3-planet.ipynb .

lgvaz · February 7, 2020, 10:05pm

Sorry sorry, still confused, an example should make it easier to understand.

How would I achieve the following: Load 2 images at a time, update between 32 images

muellerzr · February 7, 2020, 10:08pm

Assuming total dataset size is 32, we set our batch size to 2. Gradients are fully redon/updated every 2 images, and in every one pass in our data 16 total times

lgvaz · February 7, 2020, 10:12pm

But that is not really what I want to achieve with gradient accumulation.

Correct me if I’m wrong, but the idea is to not apply the gradient every 2 images, instead we calculate the gradient 2 images at a time but only apply them after N steps. In this way we can have a batch size of 1 that “feels” like a batch size of N (32 for example). This can have several advantages

muellerzr · February 7, 2020, 10:16pm

Yes, we step the gradients slowly and then once we hit the end we accumulate them all and zero

foobar8675 · February 8, 2020, 1:23am

@lgvaz not sure if this helps, but if u look at this example https://pytorch.org/tutorials/beginner/blitz/cifar10_tutorial.html#sphx-glr-beginner-blitz-cifar10-tutorial-py step #4, the bs is set to 4, so there are 4 inputs which go through the forward and backward pass right after the gradients are zeroed out.

    for i, data in enumerate(trainloader, 0):
        # get the inputs; data is a list of [inputs, labels]
        inputs, labels = data

        # zero the parameter gradients
        optimizer.zero_grad()

        # forward + backward + optimize
        outputs = net(inputs)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()

barnacl · February 8, 2020, 1:28am

i think what @Igvaz is asking (and what was asked in the video), you usually don’t want to train with bs=1 (it is unstable, batchnorm doesn’t work etc). So if there were no memory constraints and if you could train with a bs=8 that should be equivalent to training with bs=1 but accumulating the gradients for 8 images.(zeroing the grads only after these 8 images are dealt with 1 by 1).
In the code you shared are we updating gradients after we calculate for one image(as bs=1) or is fastai accumulating the gradients for certain number of images?
something like this-

model.zero_grad()                                   # Reset gradients tensors
for i, (inputs, labels) in enumerate(training_set):
    predictions = model(inputs)                     # Forward pass
    loss = loss_function(predictions, labels)       # Compute loss function
    loss = loss / accumulation_steps                # Normalize our loss (if averaged)
    loss.backward()                                 # Backward pass
    if (i+1) % accumulation_steps == 0:             # Wait for several backward steps
        optimizer.step()                            # Now we can do an optimizer step
        model.zero_grad()                           # Reset gradients tensors