Lesson 10 Discussion & Wiki (2019)

Many design patterns (if not all, AFAIK) focus on Object Oriented programming paradigm. We are dealing with a mix of Object Oriented, Functional and Dataflow paradigms. This makes OO patterns partially applicable, but not that useful within a bigger picture. We need a new methodology and new design patters to emerge.

Fastai programming style gives us an interesting example and insights into what these patterns might be. Fastai offers examples of well thought-through use of decorators, closures, partials and compose. I wish Software Engineering methodology researchers paid more attention to it.

2 Likes

I like your point about the over-emphasize of OO patterns, I guess if I would find a book with coding patterns that look beyond languages and specific paradigms it would definitely be worth the read and the fast.ai code to me is the best source I am aware of. I still suspect programming like speaking a language is a skill, where you can’t just learn grammar and some elegant ways to express yourself to become a master.

good point concerning functional programming. The processor pattern is fastai is a good match for that

1 Like

If you find such a book, please let me know.

1 Like

I see Jeremy here being complementary about Fowler

So maybe I give his newly revised book on refactoring a closer look another time

1 Like

So I was revisiting lesson 10 and had the same thought. We have this awesome GeneralRelu, why don’t we just learn all the parameters instead of predefining them. So i searched around to see if anyone on here had done it and i couldn’t find anything.

So I went ahead and implemented “LearnedRelu” which was super easy (assuming i did it right):

class LearnedRelu(nn.Module):
    def __init__(self, leak=0.1, sub=0.25, maxv=100):
        super().__init__()
        self.leak = nn.Parameter(torch.ones(1)*leak)
        self.sub  = nn.Parameter(torch.zeros(1)+sub)
        self.maxv = nn.Parameter(torch.ones(1)*maxv)

    def forward(self, x): 
        x = F.leaky_relu(x,self.leak.item())
        x.sub_(self.sub)
        x.clamp_max_(self.maxv.item()) 
        return x

So far it seems to work great. I started a separate thread on the topic with a gist of my work so far here: https://forums.fast.ai/t/learning-generalrelu-params-here-is-learnedrelu/44599

A basic (and might be silly) question regarding callbacks.

in the below code:

‘’’
class SlowCalculator():
def init(self, cb=None): self.cb,self.res = cb,0

def callback(self, cb_name, *args):
    print(cb_name)
    print('self.cb:')
    print(self.cb)
    if not self.cb: return
    cb = getattr(self.cb,cb_name, None)
    print('cb:')
    print(cb)
    if cb: return cb(self, *args)

def calc(self):
    for i in range(5):
        self.callback('before_calc', i)
        self.res += i*i
        sleep(1)
        if self.callback('after_calc', i):
            print("stopping early")
            break

class ModifyingCallback():
def after_calc (self, calc, epoch):
print(f"After {epoch}: {calc.res}")
if calc.res>10: return True
if calc.res<3: calc.res = calc.res*2
‘’’

what is calc in ModifyingCallback() class. How are we passing it from SlowCalculator().

Yeah I had the same question also. Has this been answered by some chance already? I opened a PR about this: https://github.com/fastai/fastai_docs/pull/107

how do we overcome tuple error . I am running nbk in google colab ?

Just a quick suggestion for the fastai team, I’m not a huge fan of having all the callback names as magic strings. It would be easier to have a class full of constants that defined all the callback names, would save you having to look at the docs or dig into the source code all the time.

Did you see the use of SimpleNamespace in 11a_transfer_learning?

1 Like

For the curious, here’s a plot of the stats for the latest version of RunningBatchNorm from 07_batchnorm.ipynb after a single epoch.

Note how the mean stays within ±0.15 of 0 and the std stays within ±0.3 of 1 for all layers!

2 Likes

To anyone who wants to go deeper into Batch Norm technique understanding :

1 Like

good day friends, quick question, in lesson 10:
when using our own BatchNorm layer
if bn: layers.append(BatchNorm(nf))

and then running an epoch and looking at the hooks at the start
for h in hooks[:-1]:

the graphs will show that means of all layers begin at 0, and
stds of all layers begin at 1, all great.

but if I change to use Pytorch BatchNorm layer:

if bn: layers.append(nn.BatchNorm2d(nf, eps=1e-5, momentum=0.1))

then repeating the same hook graphs, shows that yes the means
still all begin at 0, but not the stds, the stds of the different
layers don’t begin at 1 and also begin from different positions,

And I’ve just checked that a similar issue happens to me when checking the graph that shows the percentage of activations that are near 0,
when using Fast.ai batchnorm, only 20% or so are near 0

but when using Pytorch BatchNorm, the percentage is way higher:

I have checked the code against the notebook and can’t find a glitch,
why that difference?, thank you for the help

Hi, if you post the code(or a link) of your experiment then it would be easier to help you out. :wink:

1 Like

Thank you Fabrizio :wink:

so the conv_layer is

def conv_layer(ni, nf, ks=3, stride=2, bn=True, **kwargs):
    layers = [nn.Conv2d(ni, nf, ks, padding=ks//2, stride=stride, bias=not bn),
              GeneralRelu(**kwargs)]
    if bn: layers.append(nn.BatchNorm2d(nf, eps=1e-5, momentum=0.1))
    return nn.Sequential(*layers)

and then I do:

path = datasets.untar_data(datasets.URLs.IMAGENETTE_160)
bs=128
train_dl, valid_dl=get_dls(ll.train, ll.valid, bs, num_workers=4)
tfms = [make_rgb, ResizeFixed(128), to_byte_tensor, to_float_tensor]
il = ImageList.from_files(path, tfms=tfms)
sd = SplitData.split_by_func(il, partial(grandparent_splitter, valid_name='val'))
ll = label_by_func(sd, parent_labeler, proc_y=CategoryProcessor())
data = ll.to_databunch(bs, c_in=3, c_out=10, num_workers=4)
nfs = [64,64,128,128]

cbfs = [Recorder,
        partial(AvgStatsCallback,accuracy),
        CudaCallback,
        partial(BatchTransformXCallback, norm_imagenette)]

learn,run = get_learn_run(nfs, data, 0.9, conv_layer, cbs=cbfs)  

with Hooks(learn.model, append_stats) as hooks:
    run.fit(1, learn)
    fig,(ax0,ax1) = plt.subplots(1,2, figsize=(10,4))
    for h in hooks[:-1]:
        ms,ss,hi = h.stats
        ax0.plot(ms[:10])
        ax1.plot(ss[:10])
    plt.legend(range(6));
    
    fig,(ax0,ax1) = plt.subplots(1,2, figsize=(10,4))
    for h in hooks[:-1]:
        ms,ss,hi = h.stats
        ax0.plot(ms)
        ax1.plot(ss)

and append_stats is:

def append_stats(hook, mod, inp, outp):
    if not hasattr(hook,'stats'): hook.stats = ([],[],[])
    means,stds,hists = hook.stats
    if mod.training:
        means.append(outp.data.mean().cpu())

thank you again for trying to help :wink:

@fabris Fabrizio here is the notebook with the code, please see if you can help me spot where is the mistake because of which the hooks are showing the means correctly sincronized well but the stds appear wrong, thank you very much :wink: :
https://colab.research.google.com/drive/1rip1MFYwxbleZfXwH80lbJW16SNdrLFj

btw, switching to the fast.ai running batchnorm makes it all work great, and all goes perfect,
so I have switched to use the fast.ai running batchnorm, still curious why using Pytorch’s batchnorm layer the std does that

I was watching the video and noticed that the train time for using running batch norm is just over twice that of using the normal batch norm. This seems like a large slowdown and I was wondering what might cause it and if it is a concern?

In the 07 notebook, why do we have parameters gamma and beta to add and multiply after batch norm if we are going to have a linear transformation immediately after? I can see having an add (beta) if the following layer has no bias, but the scaling factor seems like wasted computation.

I ran 50 trials with 1) the original RunningBatchNorm, 2) removing gamma, 3) removing beta and gamma and adding back bias to conv layer, 4) removing gamma and placing Batchnorm before the ReLU. I found no statistical difference in the validation error for 1), 2), or 3), (p=.8) but 4) was statistically better (p=.00001). Also, the original took roughly 7% longer to run than any of the alternatives.

It looks like it is best to have batchnorm BEFORE the non-linearity, at least in the MNIST 1 epoch example. This results in the epoch running 7% faster with a 15% lower error rate. It should also be fine to remove gamma as long as the activation function is a simple ReLU. Am I missing something?

I was wondering the same thing. (sorry for digging up an old post)
Could we not have the dead activations weights to be reset/reinit once in a while and get a fresh start for training?