Look at the TensorPoint transform and youâll see we normalize our points to a % (-100% to positive 100%) from the center of the image. In this messy dataset some points were labeled even if it wasnât ever on the image. Thus we could get greater than 100% or less than -100%. So we want to clamp it down if a point is not present. This is commonly done with the COCO dataset and other keypoints (on their ground truths the point is set to -1,-1 if one isnât present)
The points are never resized (per say), they stay at the -1,-1 (or 0,0). We add this as a batch transform at the end because it simply looks at all the points after all our augmentation is done, sees if any point is out of the scope we want (-1,1), and clamps it to -1,-1 if any part is. Does this help @mgloria (Iâm more than happy to explain this as much as I can because this is a very important detail not talked about much in the fastai library)
Why do we need: `dls.c = dls.train.after_item.c
I see later that we are giving manually n_out=18 so I do not see where dls.c is actually being used.
If we only had 512 channels, how are we passing 1024 as input the adaptativeAvgPool2d and AdaptativeMaxPool2d - (even if each needs 512), how is this possible?
Usually these layers are used to reduce image size but numbers of channels stays the same.
@sgugger heâs discussing FeatureLoss. This can also be found in the SuperRes notebook from the course (I donât know the answer but hereâs the loss function from the SuperRes course notebook)
class FeatureLoss(Module):
def __init__(self, m_feat, layer_ids, layer_wgts):
self.m_feat = m_feat
self.loss_features = [self.m_feat[i] for i in layer_ids]
self.hooks = hook_outputs(self.loss_features, detach=False)
self.wgts = layer_wgts
self.metric_names = ['pixel',] + [f'feat_{i}' for i in range(len(layer_ids))
] + [f'gram_{i}' for i in range(len(layer_ids))]
def make_features(self, x, clone=False):
self.m_feat(x)
return [(o.clone() if clone else o) for o in self.hooks.stored]
def forward(self, input, target, reduction='mean'):
out_feat = self.make_features(target, clone=True)
in_feat = self.make_features(input)
self.feat_losses = [base_loss(input,target,reduction=reduction)]
self.feat_losses += [base_loss(f_in, f_out,reduction=reduction)*w
for f_in, f_out, w in zip(in_feat, out_feat, self.wgts)]
self.feat_losses += [base_loss(gram_matrix(f_in), gram_matrix(f_out),reduction=reduction)*w**2 * 5e3
for f_in, f_out, w in zip(in_feat, out_feat, self.wgts)]
if reduction=='none':
self.feat_losses = [f.mean(dim=[1,2,3]) for f in self.feat_losses[:4]] + [f.mean(dim=[1,2]) for f in self.feat_losses[4:]]
for n,l in zip(self.metric_names, self.feat_losses): setattr(self, n, l)
return sum(self.feat_losses)
def __del__(self): self.hooks.remove()
dls.c is made if we pass our dataloaders to cnn_learner. It will read this to figure out how many outputs we want in our head
It gets split between the two, so it turns into 512 (they both run at once). Another thing to explore for this is how create_head uses the input filters
Thanks @mgloria. So in the case of style transfer we capture the layer specific predictions and compare to the target to generate loss and this loss needs to be differentiated to backprop and reduce loss hence detach = False. Got it.
Yes - I understood that you detach when you DO NOT need gradients. Did not grasp that gradients were needed when doing feature loss which I now get from @mgloriaâs explanation. Thanks
Hi @muellerzr I just gone through the notebooks of the last video, I see you manually calculated the mean and std to normalize the dataset for the bengali notebook, but you were using a pretrained model, I donât understand why you chose to use the datasetâs mean and std while you should have used the imagenet_stats to normalize?
Because I donât have a 3 channel image (Normalizing ImageNet makes it 3 channels). To make up for that I also changed the first conv layer to accept our 2D image. Jeremyâs rule of thumb: ALWAYS transfer learning when and where you can
I see, now it makes sense. I have always treated grayscale images as 3 channel images by using .convert(âRGBâ) Pil function, I thought that would be a better approach since you donât need to mess with the model at all and imagenet_stats can be used to normalize. Which is a better approach from your view?
Well, itâs not a 3 channel image, so Iâd rather keep it not as one personally And the dataset is only 2 channels. So Iâd need itâs info still. Even if I made it 3 channels. Because thatâs still not similar to ImageNet enough to use their stats (IMO) which then says why so I use their weights, again always pretrained. You can get Atleast something from their weights
Yes, so I donât know which is better or worse mabye theyâre the exact same! Many ways to skin a cat. This just one method I saw being used for situations like this
I think converting the gray scale to 3-channel would be better as youâre not throwing away the very first layer, having said that CNN is a powerful model, So I also think it doesnât matter much which method one uses. Regarding the transfer learning, do you think using transfer learning from huge handdrawn datasets like quickdraw would give better performance? Has anyone done that?
Is your style transfer example based on Gatysâs 2015 âA Neural Algorithm of Artistic Styleâ (what Jeremy describes as âthe original way to do itâ in his video)?