Yeah I had the same question also. Has this been answered by some chance already? I opened a PR about this: https://github.com/fastai/fastai_docs/pull/107
how do we overcome tuple error . I am running nbk in google colab ?
Just a quick suggestion for the fastai team, I’m not a huge fan of having all the callback names as magic strings. It would be easier to have a class full of constants that defined all the callback names, would save you having to look at the docs or dig into the source code all the time.
Did you see the use of SimpleNamespace
in 11a_transfer_learning
?
For the curious, here’s a plot of the stats for the latest version of RunningBatchNorm
from 07_batchnorm.ipynb
after a single epoch.
Note how the mean stays within ±0.15 of 0 and the std stays within ±0.3 of 1 for all layers!
To anyone who wants to go deeper into Batch Norm technique understanding :
good day friends, quick question, in lesson 10:
when using our own BatchNorm layer
if bn: layers.append(BatchNorm(nf))
and then running an epoch and looking at the hooks at the start
for h in hooks[:-1]:
the graphs will show that means of all layers begin at 0, and
stds of all layers begin at 1, all great.
but if I change to use Pytorch BatchNorm layer:
if bn: layers.append(nn.BatchNorm2d(nf, eps=1e-5, momentum=0.1))
then repeating the same hook graphs, shows that yes the means
still all begin at 0, but not the stds, the stds of the different
layers don’t begin at 1 and also begin from different positions,
And I’ve just checked that a similar issue happens to me when checking the graph that shows the percentage of activations that are near 0,
when using Fast.ai batchnorm, only 20% or so are near 0
but when using Pytorch BatchNorm, the percentage is way higher:
I have checked the code against the notebook and can’t find a glitch,
why that difference?, thank you for the help
Hi, if you post the code(or a link) of your experiment then it would be easier to help you out.
Thank you Fabrizio
so the conv_layer is
def conv_layer(ni, nf, ks=3, stride=2, bn=True, **kwargs):
layers = [nn.Conv2d(ni, nf, ks, padding=ks//2, stride=stride, bias=not bn),
GeneralRelu(**kwargs)]
if bn: layers.append(nn.BatchNorm2d(nf, eps=1e-5, momentum=0.1))
return nn.Sequential(*layers)
and then I do:
path = datasets.untar_data(datasets.URLs.IMAGENETTE_160)
bs=128
train_dl, valid_dl=get_dls(ll.train, ll.valid, bs, num_workers=4)
tfms = [make_rgb, ResizeFixed(128), to_byte_tensor, to_float_tensor]
il = ImageList.from_files(path, tfms=tfms)
sd = SplitData.split_by_func(il, partial(grandparent_splitter, valid_name='val'))
ll = label_by_func(sd, parent_labeler, proc_y=CategoryProcessor())
data = ll.to_databunch(bs, c_in=3, c_out=10, num_workers=4)
nfs = [64,64,128,128]
cbfs = [Recorder,
partial(AvgStatsCallback,accuracy),
CudaCallback,
partial(BatchTransformXCallback, norm_imagenette)]
learn,run = get_learn_run(nfs, data, 0.9, conv_layer, cbs=cbfs)
with Hooks(learn.model, append_stats) as hooks:
run.fit(1, learn)
fig,(ax0,ax1) = plt.subplots(1,2, figsize=(10,4))
for h in hooks[:-1]:
ms,ss,hi = h.stats
ax0.plot(ms[:10])
ax1.plot(ss[:10])
plt.legend(range(6));
fig,(ax0,ax1) = plt.subplots(1,2, figsize=(10,4))
for h in hooks[:-1]:
ms,ss,hi = h.stats
ax0.plot(ms)
ax1.plot(ss)
and append_stats is:
def append_stats(hook, mod, inp, outp):
if not hasattr(hook,'stats'): hook.stats = ([],[],[])
means,stds,hists = hook.stats
if mod.training:
means.append(outp.data.mean().cpu())
thank you again for trying to help
@fabris Fabrizio here is the notebook with the code, please see if you can help me spot where is the mistake because of which the hooks are showing the means correctly sincronized well but the stds appear wrong, thank you very much :
https://colab.research.google.com/drive/1rip1MFYwxbleZfXwH80lbJW16SNdrLFj
btw, switching to the fast.ai running batchnorm makes it all work great, and all goes perfect,
so I have switched to use the fast.ai running batchnorm, still curious why using Pytorch’s batchnorm layer the std does that
I was watching the video and noticed that the train time for using running batch norm is just over twice that of using the normal batch norm. This seems like a large slowdown and I was wondering what might cause it and if it is a concern?
In the 07 notebook, why do we have parameters gamma and beta to add and multiply after batch norm if we are going to have a linear transformation immediately after? I can see having an add (beta) if the following layer has no bias, but the scaling factor seems like wasted computation.
I ran 50 trials with 1) the original RunningBatchNorm, 2) removing gamma, 3) removing beta and gamma and adding back bias to conv layer, 4) removing gamma and placing Batchnorm before the ReLU. I found no statistical difference in the validation error for 1), 2), or 3), (p=.8) but 4) was statistically better (p=.00001). Also, the original took roughly 7% longer to run than any of the alternatives.
It looks like it is best to have batchnorm BEFORE the non-linearity, at least in the MNIST 1 epoch example. This results in the epoch running 7% faster with a 15% lower error rate. It should also be fine to remove gamma as long as the activation function is a simple ReLU. Am I missing something?
I was wondering the same thing. (sorry for digging up an old post)
Could we not have the dead activations weights to be reset/reinit once in a while and get a fresh start for training?
In the last line of the callback function, we have:
if cb: return cb(self, *args)
the “self” here refers to the callback instance object.
This “self” does not denote the ModifyingCallback instance object.
Hopefully this helps.
I have a very basic question about using softmax. I am not sure if I understood what Jeremy explained in the lesson.
Is it a good practice to use categorical cross entropy as the loss function (it uses softmax), if we are not going to use softmax in production code for our model predictions? Or should we use binary cross entropy always, even if we have a single-label problem?
I have trained an accurate classifier using categorical cross entropy, and I use the output probabilities for several tasks, as ignoring unknown images. I do not apply softmax to the predicted images, but I used it during training.
Is it a good practice?
Thank you!!
It seems that at the end of notebook 06_cuda_cnn_hooks_init.ipynb
, we don’t actually use GeneralReLU
, because we don’t pass any arguments to get_learn_run
as **kwargs
, so we end up using regular ReLU. Am I missing something?
HI,
Please i got an error msg when try to use .var((0,2,3),keepdim=True) in batchnorm. the error msg is dim takes in int and not tuples. I was informed that this operation is only available in nightly version of pytorch.
I install the night version with :
pip install torch -f https://download.pytorch.org/whl/nightly/cu90/torch.html
Tho my cuda version is 9.1 . I noticed this only install torch 1.2.0 and did not install torchvision.
I run of the notebokk to test if it works, am getting this error msg
The issue is now that pytorch did not have a version for cuda 9.1. Though i have the normal pytorch and fastai working properly. i created a new enviroment just to be able to run the.var using nightly version
Hi,
Can someone explain the m.register_forward_hook(partial(f, self) part the is used inside the Hook class?
When Jeremy did the same with a function he used m.register_forward_hook(partial(append_stats, i)) where the i is the index of the current module.
how is the self on the partial can act the same as the i index?
Thanks!
Hi @Vertigo42, here is my take
-
append_stats is implemented twice: the 1st time it is given an index i so that it knows which list in the list-of-lists it appends to; the 2nd time it is given a hook so that it knows to store data inside that hook instance.
-
m.register_forward_hook is a Pytorch functionality attached to any instance on nn.Module m, and allows u to define a hook to module by passing it any function, with one catch: the function has to have 3 and only 3 args, (mod,inp,outp)
-
Jeremy/Sylvain’s workaround is to “preload” the function to be passed with other args beyond those 3 so that by the time it is passed to the pytorch, it (the function) has access to all the (directions, access, references) it need to write the data, (here stats) wherever it needs to.
-
thats where partial comes in, it preloads the function with index in the 1st implementation of append_stats; it preloads the hook instance itself that is created with that function, in the 2nd final implementation.
-
so when u call the constructor of Hook with a function f that we know takes 4 args, its gonna preload it with the first arg (the instance itself) which is the destination of all the writing append_stat is going to do, and be “disguised” as a 3 args-function for Pytorch
I am attaching a link to my UBER-annotated version of the notebook where I comment any line of code that does sth significant. Very verbose, but thats how I helped myself understand what is going on, while watching video the 2nd time and running the code.
Colab notebook - lets u can open it…
Hope it all helps
Lamine
Thank for the quick and detailed answer.
The part that confused me was that I thought the register_forward_hook is a function that require three parameters, but I now understand from your answer that it require a function that has three parameters.
This way the implementation totally makes sense.
Thanks again.