Nicely said. That’s certainly how may process works - I don’t think I’m smart enough to do any other way…
Oh lord that’s a question that’s so damn hard to answer and we’re all struggling with it!
We’ll definitely need a new thread for this pretty soon. I used to do TDD for everything I wrote, but nowadays with Jupyter Notebook my process is very different - much more based on interactive visual testing as I go. But tests will be required to ensure that future PRs, refactorings, etc don’t break things (which has often happened in v0).
One thing that @Sylvain and I are finding helpful at the moment is to have a very strong CIFAR-10 benchmark for both accuracy and speed. It runs in <10 mins, and gives us very strong confidence that we haven’t screwed anything up if we see our accuracy and speed maintained. And if we do something that we expect to make things faster or more accurate, and we don’t see that in our CIFAR-10 results, it indicates that we may have stuffed something up.
Obviously this doesn’t tell us where we stuffed something up - but the biggest issue by far in my experience in ML is not being aware at all that you have a problem; instead, you just get less good models, but you don’t even realize they’re not as good as they could be. A couple of examples from this recent post from Smerity: