How to do reproducible models and "unit" testing?

Workflow for neural nets and machine learning in general is quite different from conventional programming. One of such differences is the approach to testing. When you are making a database, a compiler or another similar project, you usually write unit or integration tests to verify that things are working as expected. Of course, not everything can be easily tested (it’s harder for GUIs or device drivers, for example), but people are still trying to do that, especially for serious projects.

ML and NNs are different because you don’t have a simple and reliable pass/fail test for whether the model is doing the right thing. As John Carmack said in the blog about his first experience with NNs:

It is interesting that things still train even when various parts are pretty wrong — as long as the sign is right most of the time, progress is often made.

This issue bites the most when someone is trying to implement a model from a paper, or port a model to another environment, or figure out why the model’s performance is dropped a platform update etc.

Essentially, there are 2 main questions:

  1. How to implement models so that they can be easily reproduced by a third party?
  2. How to write “unit” tests for models?

Right now I can think of complete specification of everything that you are using, right from PRNG seeds and algorithms, so that the output can be made deterministic. This in theory would allow you to write input-output pairs, maybe together with a loss value, that the model should output after, say, 10 epochs on a very small dataset. However, it didn’t see anyone doing that.

And the last question: it is really a problem, or maybe it’s just doesn’t matter much, as long as the model does something sensible? I’m just so uncomfortable writing such fragile code without any automated testing at all…


It’s important, hard, and poorly understood. There’s a few ideas in this article, and there’s a paper about Unit Tests for Stochastic Optimization.

It would be a great contribution for us to try to enumerate a more extensive list of types of tests we could create. E.g. checking 99 percentile of activations and weights after an epoch, checking gradients at every layer (eg % of zeros, max gradient, 90% gradient), check range of outputs and inputs, etc.


I’ve been thinking about this too. As a professional developer, most all the ml/dl code I look at generally looks awful and makes me scared. Also, the idea that people call Jupyter Notebooks “reproducible” is kinda hilarious. I think there’s two broad areas that could be really useful:
1.) Easier Diagnostics - These would not be tests, but instead would be tools to provide simple graphs, stats, warnings that can automatically be generated from running PyTorch models. Stuff like what Jeremy mentioned of checking activations and weights. Or “which ones did I get most wrong?”, “most right?”, “most unsure?”. If those were dead simple, and was just like “model.diagnose()”, that could be killer.
2.) Fuzzy Testing - As you said, ML/DL doesn’t have simple/reliable pass/fail tests. Or, they do, but it’s a royal pain to wrangle it down to that. I feel like you want something fuzzier, to be used with real data, and ideally something that could be written right in a notebook when you’re prototyping. Like… for the translate notebook this week, you might have tests that look like this…

## Do questions roughly look right?
# at least 30 percent of questions should include the word 'what'
expect(at_least(0.3), qs[:, 0]).to(include('what'))

# Did we actually set the word vectors from fast text correctly?
# At least 20% of my word vectors should equal a word vetor in fast text
expect(at_least(0.2), word_vec).to(equal(ft_word_vect))

# Did we set them on the model correctly?
# If the weights aren't random, (and aren't zeros or 1s), I probably did this correctly.

Like, simple stuff like that, that you would feel comfortable doing in a notebook, and could be super useful as you’re pre-processing data or debugging your model. But powerful enough that if you want to use them in a regular python file for a serious test suite, you could.

I think a lot of the trouble here would be making this kinda stuff performant, so that people find it nice to do this, and not a chore.

I’d be down to spike on this (eg. do like a hack-day kinda session on it) and see how it goes, if you’re interested.


I’d be interested to work on this as well, but not until the course is over. :-/

yeah totally. sounds good.

Yes these examples are great. They’re integration tests, and it’s OK if they’re not that fast - maybe you just run them at the end of the day, and only for the module you’ve been working on.


Yes, I’d love to do a kind of mini-remote-hackaton-brainstorming on the topic, but possibly after the course. In the meantime, we could read a paper in the Jeremy’s answer or something similar.

I’m thinking that one very simple thing we could do is add something to each fit() call to check which layers changed in the call, and report ones which didn’t change.

Or maybe refuse to fit if all the layers are frozen?

1 Like

I think the optimizer will raise an error if you try to pass it params of a model (which is what happens before fit gets executed) if the entire model is frozen. The one thing I have been thinking would be really cool though is some nice, concise way of showing to the user which layers are frozen / bn_frozen. I know you can do the learn.summary() but that output is amazingly long.

Or if we would have some way of showing layer groups and what the settings for them are.

1 Like