Also if you look at slide 32 of Dmytro Mishkin - CNNS - FROM THE BASICS TO RECENT ADVANCES 2016 - his metrics over different architectures (of that time) show that before or after placement of BN really depends on the dataset and the specifics of your architecture.
So, probably, it’s best to place it after as discussed, but to also test before and compare the 2.
@t-v@stas unfortunately that doesn’t quite work. I had to put this back to get good results in the final bs=32 cell:
self.sums.detach_()
self.sqrs.detach_()
Otherwise it doesn’t have the gradients that it needs to get good results.
However it doesn’t actually skip any computation at the moment, since you’re checking self.batch%2, which is always true, since bs is even. You should instead create and check an iteration counter - if you do that, you’ll find you still have the dreaded “Trying to backward through the graph a second time, but the buffers have already been freed” error!..
Do you mean with skipping some stats re-calculations? or are you saying that replacing:
self.sums.detach_()
self.sqrs.detach_()
with:
x = x.detach()
and not skipping any calculations has a detrimental impact on the outcome?
However it doesn’t actually skip any computation at the moment, since you’re checking self.batch%2 , which is always true, since bs is even. You should instead create and check an iteration counter - if you do that, you’ll find you still have the dreaded “Trying to backward through the graph a second time, but the buffers have already been freed” error!..
So now all those running temps will no longer do the right thing, since they will get changed by backprop and we want them to be fixed.
Moreover you are detaching them in the wrong place. You detach them at the beginning of update_stats, but then you make a calculation on them which involves undetached x and they end up being on the graph again! So you want to detach them after all calculations are done if you don’t detach x. But as I have shown above this is not right either, since a whole bunch of other temps are now on the graph and will be “adjusted” by the net.
Now going back to the very original implementation as it was presented in the class (with dbias), we get:
not leaf ['sums', 'sqrs']
want grad ['mults', 'adds', 'sums', 'sqrs']
So it wasn’t detaching them either!
Only after you move them to the end of update_stats:
self.sums.detach_()
self.sqrs.detach_()
you get:
not leaf []
want grad ['mults', 'adds']
I’m still trying to wrap my head around this detach thing, so please bear with me if I’m saying an incorrect thing. If what I described above is correct, then you were getting good results not because of the better BN (or at least not just because of it), but because your temps were actually backpropagated, so the stats weren’t calculated on the running averages, but on running averages that are also variables that are learnable - i.e. the network was messing (in a good way) with those numbers that we intended to be fixed. Does this make sense?
And this in a long way answers why you get the error. You tried to skip calculations on variables that are on the graph and that’s why you get the error.
If you detach all of those other temp vars, you won’t get the error. i.e. finish update_stats with:
l = "sums sqrs count means varns s ss c".split()
for a in l: getattr(self,a).detach_()
The originally proposed:
x = x.detach()
at the very beginning of update_stats does the same thing, but more efficiently, since the code doesn’t need to swap temp vars back and forth to require grads and then not.
To conclude: decide which variables are to be fixed and controlled only by you, and which are learnable, and then use detach accordingly. Perhaps a significant part of the magic of RunningBatchNorm is a side-effect of a coding mistake
“Trying to backward through the graph a second time, but the buffers have already been freed. Specify retain_graph=True when calling backward the first time.”
What I understood from Thomas is that pytorch compares the graph from the previous batch with the graph generated by the current batch - if the new graph lacks a node that was in the old graph that’s when you get this error.
That’s why you must repeat calculations that place all the variables that were in the graph on the very first batch, and they can’t be skipped.
I don’t know enough pytorch yet to explain why dynamically skipping a node is considered to be an error. So this is only a circumstantial explanation.
@t-v, please kindly correct me if I my explanation is incorrect or incomplete.
To get around my conflict issues in conda with pytorch nightly build:
In the past we used local directories to import from and there were no conda installs. So the basic use of datasets is to load data from the net. If fastai is cloned to your local environment or even just the datasets part and imported from there you could work without a conda fastai version.
I took from from the github clone 3 items
datasets.py
core.py
the imports directory with all its contents
and placed in as fastai directory at the same level as exp
After using you step 1 above for the conda no-fastai environment I installed these additions
fastprogess jupiter pyyaml and yaml requests
This gave me an environment which has minimal fastai and which on the running of 08_data_block works fine at least until the image of a man with a TENCH
I also installed for my own use
pandas, pandas-summary, sklearn-pandas scipy
I hope my memory serves me right here so in case it doesn’t
I’m not sure what problem you’re trying to solve, @RogerS49 - just install fastai in whatever way you like - conda, pip, local checkout and it just works with the part2 lessons.
Well I got rid of my conflict issues in conda with pytorch nightly build. This makes more sense to me as whats in those packages and dependencies are not really fastai it seems except around data URLs. I managed to run the whole of the 08_data_block notebook, perhaps I may run into other dependency issues I am not aware of. Thanks for your reply.
so now when you load a notebook from fastai_docs/dev_course/dl2 it will use these local fastai modules since '' (nb dir) is always in sys.path. This way you don’t have any dependency conflicts to deal with since you’re not using any package manager here.
This is more or less what you suggested you did above, just easier since you don’t need to go and fish out specific files from the fastai modules.
I just quickly plotted the layer norm vs batch norm for the sunny vs foggy day to double-check Jeremy’s thought on why layer norm doesn’t do well. And the plots confirmed it.
Sunny road and foggy road before (top row) and after (bottom row) applying layer norm
Although I can’t offer a resource, I can offer empathy. I was fairly relieved when Jeremy noted that his utility function for loading images took a week to develop. I would’ve felt like throwing in the towel if he had said he wrote it while eating breakfast one morning.
Personally, while I totally get the idea of patterns and clean code, I find many books and articles on the subject verbose, sometimes a little dogmatic, I do not agree necessarily in the details (typically I find they make simple things complex honestly) and always dry to read. Maybe its like writing a book about salsa dancing: style matters but you just get it on the dancefloor (never ever with a book).
the clean code reference looks like good housekeeping rules.
concerning refactoring i have not read the book. However Martin fowler is one of my heroes and with a foreword of Eric Gamma (one of the authors og the Gof book) it doesn’t get better.
I think that design pattern are important in the same way that we expect certain components to be standardised when building a house. It is just too much mental overhead (and often short sighted) if everybody invents their own personal way of doing things. This is not to says that a design pattern is implement in identical ways in every language but the concept should transcend languages.
I know some people have this point of view, and that’s fine. Personally however I find the exact opposite - I’ve found trying to shoehorn things into a set of predefined patterns limits my thinking and is harder for me to understand than avoiding that idea entirely.
Many design patterns (if not all, AFAIK) focus on Object Oriented programming paradigm. We are dealing with a mix of Object Oriented, Functional and Dataflow paradigms. This makes OO patterns partially applicable, but not that useful within a bigger picture. We need a new methodology and new design patters to emerge.
Fastai programming style gives us an interesting example and insights into what these patterns might be. Fastai offers examples of well thought-through use of decorators, closures, partials and compose. I wish Software Engineering methodology researchers paid more attention to it.
I like your point about the over-emphasize of OO patterns, I guess if I would find a book with coding patterns that look beyond languages and specific paradigms it would definitely be worth the read and the fast.ai code to me is the best source I am aware of. I still suspect programming like speaking a language is a skill, where you can’t just learn grammar and some elegant ways to express yourself to become a master.