Most any layer that looks different or odd is in the layers.py file. If you look you can see PixelShuffle is just nn.PixelShuffle (plus a few bits)
True. But I want to know theory behind it and why itās used extensively by fastai. I believe Jeremy once said heāll explain the same in the second part of course but I didnāt find any reference to that. Looking for article/paper expounding on this topic.
For pixel shuffle specifically the Torch code itself references this paper:
for pytorch tutorials, i personally liked this one. it is more an intro https://pytorch.org/tutorials/beginner/deep_learning_60min_blitz.html
Found a few answers after exploring the code as @muellerzr suggested.
If you pass just slice(end) then the last group's learning rate is end, and all the other groups are end/10.
here (look under lr_range
). What role slice
plays can be seen here.
So now we get three lrs ie [lr/10, lr/10, lr]. how is the network split so we can apply these different lrs?
This happens in the Learner it takes a parameter called splitter
. splitter
are a bunch of functions that are defined based on the architecture family.
As we are dealing with a resnet we have - def _resnet_split(m): return L(m[0][:6], m[0][6:], m[1:]).map(params)
So we start with a resnet look for the last pooling layer and remove everything from that pooling layer onwards(including pooling layer). What is left is the body which is split into - m[0][:6], m[0][6:] and each of these get lr/10.
Since this is the pretrained part you donāt want to fiddle around too much. Infact when the model is frozen these are not updated. The lr makes a difference only when we unfreeze.
The m[1:]
is all the new stuff we add which is the ābottom part of the U in unetā (ie middle_conv) and the decoder( (look at the code we add a little more). These layers are initialized with ārandomā weights (kaiming init) and that is why have a larger learning rate.
The interesting thing i found is that the way the models are split is not specific to an architecture rather is specific to a family of architectures. (IIRC Jeremy mentioned that he experimented a bit and defined the splits, not sure of the science behind it, if there is one)
You want a higher lr for all the new layers as these are ārandomā weights.
For the body the layers closest to the input need little tweaking. And the layers after that a little more tweaking but less than the newly added layers.
Please correct me where iām wrong @Srinivas
This is why irrespective of segmentation or classification we split the same way and assign the lrs the same way when we use pretrained architectures
in the segmentation notebook @muellerzr you mention #let's make our vocabulary a part of our DataLoaders, as our loss function needs to deal with the Void label
i think you meant accuracy and not loss function.
@mgloria, @muellerzr i wrote a split_subsets function based on your first suggestion. itās here. https://colab.research.google.com/drive/1nTetOULwzZzOZ8849QM7ZQTLcCTH3V1V#scrollTo=jpQs3pDoh7y7 the name of the function is SubsetSplitter
. if there is any feedback ā¦
Yes I did, Iāll make that adjustment later today.
wouldnāt first doing a randperm on all the ids and then cutting be safer. if things were arranged in contiguous groups shuffling it and then cutting it would hopefully pick a good distribution @foobar8675
it might be and i was thinking of that initially, but zachs suggestion to do it this was (assuming i was understanding him correctly) makes it such that it could used after splitting with a fastai splitter.
i think if there were to be a randperm on all the ids, then it would have to be a replacement for RandomSplitter, GrandparentSplitter, ā¦ i think.
Yes @foobar8675ās idea is how I wouldāve implemented it Atleast logically. The assumption is we have pre-defined validation and train which could come from using a different splitter first to which then we take a subset of both or all (if we have more than 2). Similar to how Lookahead() can be wrapped around any base optimizer
In the ML course this is how jeremy does it.
I see what you mean. i think we should be doing it on items
before we pass it in.
I was thinking we should randomly subset(if permissible) the predefined validation and train and then pass it to the splitter. I guess both ways would be the same.
Iāll look at some examples of https://github.com/fastai/fastai2/blob/master/fastai2/optimizer.py#L268 . itās not something iāve explored at all. do you think that would make for a more usable api?
the way jeremy is in the link you posted makes sense and am glad to change. the code i wrote is influenced by this
which is just the fastai1 way of doing it.
(i do want to explore the optimizer way of doing it a bit and see what comes of it. kind of fun to see how all this is wired together)
I think so cause now we can just wrap it around any splitter (what I initially had in mind). LookAhead is the only one that works like that.
this is the part i was pointing out that will be safer if you add in your implementation
For image regression what type of explanation mechanisms we can use for DL models? I can think of CAM, Layer visualization roc, AIC but not confusion matrix etcā¦ Any suggestions are welcomeā¦
GradCAM and layer visualization are pretty much it for the most part, focusing on what the attention is. You could also then isolate to what each point is in the output as weād assume that y1 would go to y1 on our ground truth, etc, so we could see which point is having the highest difficulty
Attention is also good one ā¦ By the way I was trying to find not only for point regression but more general to numeric onesā¦