Understand vision.learner

PegasusWithoutWinds · May 17, 2019, 3:15pm

Recently I started digging deeper into the library and today I have been working on vision.learner. There is certainly a wealth of wisdom and neat tricks built into it, as anyone who attempted to read the source code could tell, but due to the lack of documentation and comments in the source code, I could not fully grasp the rationale behind each of the slight tweaks.

Let’s take create_head as an example:

def create_head(nf:int, nc:int, lin_ftrs:Optional[Collection[int]]=None, ps:Floats=0.5,
                concat_pool:bool=True, bn_final:bool=False):
    "Model head that takes `nf` features, runs through `lin_ftrs`, and about `nc` classes."
    lin_ftrs = [nf, 512, nc] if lin_ftrs is None else [nf] + lin_ftrs + [nc]
    ps = listify(ps)
    if len(ps) == 1: ps = [ps[0]/2] * (len(lin_ftrs)-2) + ps
    actns = [nn.ReLU(inplace=True)] * (len(lin_ftrs)-2) + [None]
    pool = AdaptiveConcatPool2d() if concat_pool else nn.AdaptiveAvgPool2d(1)
    layers = [pool, Flatten()]
    for ni,no,p,actn in zip(lin_ftrs[:-1], lin_ftrs[1:], ps, actns):
        layers += bn_drop_lin(ni, no, True, p, actn)
    if bn_final: layers.append(nn.BatchNorm1d(lin_ftrs[-1], momentum=0.01))
    return nn.Sequential(*layers)

There are quite a few tricks here:

Used AdaptiveConcatPool2d which is essentially a concatenation of both average pooling and max pooling.
Added an intermediate fully connected layer with an output size of 512 before going through the fully connected layer of size equal to the number of classes to predict.
Can optionally add a batch norm final layer.

Have the reasons for adding these tricks ever been covered in classes? If not, how could I better learn and understand such deep learning hacks which I suppose are the distillation of years of the practical wisdom of Jeremy and Sylvain.

Daniel · May 20, 2019, 12:09am

This is a great question! I am very interested in what @jeremy would say on this too. Just before Jeremy finds time to reply to this thread, maybe we could borrow the spirit of experimentation he taught us in v3 part 1 videos, and imagine what would Jeremy say here.

First of all, “the reasons for adding these tricks”, could be expanded into the following chain of actions:

stimulation/inspiration: could come from a new paper, a discussion, or some experiments. e.g., (borrow from Jeremy’s story of He and ResNet in v3 part 1 course) in the case of res-block, the stimulation or inspiration came from a totally unexpected experiment outcome when comparing a shallow model with a deep one
brain-storming: on clever ways of exploring the unexpectation behind. In the case of ResNet, instead of directly searching for why a deep net behaves worse, he took a seemingly much smaller step forward by asking how can we make a deep model working equally well with a shallow one. As a result, he came up with an identity function idea to try to make a deep model be equivalent to a shallow one in structure.
experiment designing: Then what naturally follows from above is how to design an experiment to fairly compare a deep net with res-block/identity/skipping block with a shallow net by eliminating other factors
outcome analysis: how to analyse the experiment outcomes properly to have a statistically justified conclusion, would be the final step.
of course, the four steps may have to iterate a few times before reaching a satisfactory conclusion.

Obviously Jeremy must have gone through all those steps above and much beyond to get those tricks into fastai library, then the story of step 1 and 2 is totally up to him to make. But step 3 and step 4 seem much more scientific and objective, which could exist standard methodologies for carrying out. If so, this is what we as students can try to figure out to experiment and prove that those tricks do make a huge improvement.

However, so far, Jeremy’s suggestions on step 3 and 4 are Try bla. Maybe, we should not over-complicate the methodologies for doing step 3 and step 4; or maybe, those methodologies are quite standard and can be commonly found in any influential DL paper such as He’s paper on ResNet? (I confess I never pay much attention to experiments section in papers).

Can anyone shed some light on the experiment and analysis methodologies?
What do you think of the 4 step process above?

Thanks