Workflow for neural nets and machine learning in general is quite different from conventional programming. One of such differences is the approach to testing. When you are making a database, a compiler or another similar project, you usually write unit or integration tests to verify that things are working as expected. Of course, not everything can be easily tested (it’s harder for GUIs or device drivers, for example), but people are still trying to do that, especially for serious projects.
ML and NNs are different because you don’t have a simple and reliable pass/fail test for whether the model is doing the right thing. As John Carmack said in the blog about his first experience with NNs:
It is interesting that things still train even when various parts are pretty wrong — as long as the sign is right most of the time, progress is often made.
This issue bites the most when someone is trying to implement a model from a paper, or port a model to another environment, or figure out why the model’s performance is dropped a platform update etc.
Essentially, there are 2 main questions:
How to implement models so that they can be easily reproduced by a third party?
How to write “unit” tests for models?
Right now I can think of complete specification of everything that you are using, right from PRNG seeds and algorithms, so that the output can be made deterministic. This in theory would allow you to write input-output pairs, maybe together with a loss value, that the model should output after, say, 10 epochs on a very small dataset. However, it didn’t see anyone doing that.
And the last question: it is really a problem, or maybe it’s just doesn’t matter much, as long as the model does something sensible? I’m just so uncomfortable writing such fragile code without any automated testing at all…
It would be a great contribution for us to try to enumerate a more extensive list of types of tests we could create. E.g. checking 99 percentile of activations and weights after an epoch, checking gradients at every layer (eg % of zeros, max gradient, 90% gradient), check range of outputs and inputs, etc.
I’ve been thinking about this too. As a professional developer, most all the ml/dl code I look at generally looks awful and makes me scared. Also, the idea that people call Jupyter Notebooks “reproducible” is kinda hilarious. I think there’s two broad areas that could be really useful:
1.) Easier Diagnostics - These would not be tests, but instead would be tools to provide simple graphs, stats, warnings that can automatically be generated from running PyTorch models. Stuff like what Jeremy mentioned of checking activations and weights. Or “which ones did I get most wrong?”, “most right?”, “most unsure?”. If those were dead simple, and was just like “model.diagnose()”, that could be killer.
2.) Fuzzy Testing - As you said, ML/DL doesn’t have simple/reliable pass/fail tests. Or, they do, but it’s a royal pain to wrangle it down to that. I feel like you want something fuzzier, to be used with real data, and ideally something that could be written right in a notebook when you’re prototyping. Like… for the translate notebook this week, you might have tests that look like this…
## Do questions roughly look right?
# at least 30 percent of questions should include the word 'what'
expect(at_least(0.3), qs[:, 0]).to(include('what'))
# Did we actually set the word vectors from fast text correctly?
# At least 20% of my word vectors should equal a word vetor in fast text
# Did we set them on the model correctly?
# If the weights aren't random, (and aren't zeros or 1s), I probably did this correctly.
Like, simple stuff like that, that you would feel comfortable doing in a notebook, and could be super useful as you’re pre-processing data or debugging your model. But powerful enough that if you want to use them in a regular python file for a serious test suite, you could.
I think a lot of the trouble here would be making this kinda stuff performant, so that people find it nice to do this, and not a chore.
I’d be down to spike on this (eg. do like a hack-day kinda session on it) and see how it goes, if you’re interested.
I think the optimizer will raise an error if you try to pass it params of a model (which is what happens before fit gets executed) if the entire model is frozen. The one thing I have been thinking would be really cool though is some nice, concise way of showing to the user which layers are frozen / bn_frozen. I know you can do the learn.summary() but that output is amazingly long.
Or if we would have some way of showing layer groups and what the settings for them are.