Automated testing and deep learning?


(Julian Ramirez) #1

I’m curious if anyone has any experience in using any automated tests for their ML projects. During Jeremy’s lecture last night for Lesson 8 he had mentioned how challenging it can be to work on a deep learning project because of the lack of feedback, it seems like this could be an opportunity for tests to provide at least some feedback that you’re on the right path… or at least to double check that you don’t have any obvious/subtle bugs in your implementation that are chipping away at your accuracy.

It’s likely unreasonable to have an all-encompassing test like “my model should find the couch and draw a square around it”, but it could be super helpful to double check that a helper function you define in the notebook does the right thing for the expected input and returns the correct shape in the output. That said, Jupyter is pretty awesome and lessens the need for tests since it makes it so easy to just poke a function and see what it does!

This blog post does a nice job of explaining why you may want to test your ML code, and its follow-up library for TensorFlow here.

I’m happy to explore this idea myself (hopefully I can turn it into a blog article later!) but I also would welcome anyone else’s experience or suggestions!


(Jeremy Howard (Admin)) #2

For computer vision, I always have tests in my notebook that things “look right”. For other data types, it takes a bit more creativity…


(Jason McGhee) #3

And here I thought this thread was going to be on the subject of using a generative model to create automated tests for web apps (for use with selenium)


(Sven) #4

How about a new set of data. Additionally to the training/validation/test set you’d have a mandatory set/unit test set. The unit test set could be much smaller than the other sets but their results are mandatory and regressions will be reported by a continuous integration system. E.g. in gitlab you could define that all the regression tests need to pass before you could check in new code (changes to the model).


(nczx98w3.nsaonw3) #5

I always have thougt of training/validation as an integrated TDD, does that count in itself?


Was reading the article and was thinking that is not exactly TDD, is most like a guard system or stretching it a little like a linter/guard system (because linters are used to guide you trought bad practices to avoid them).

The objective of TDD is to not allow you to code more until you first solve that issue, thought general tests like this (see what was before and after) is more like making sure you have done something with the “black boxes” you have just executed, but in fact you cant expect it to guide your working because you do that be watching and modifiying parameters. Because you will need to write a test that you input this and you get that (but that is the validation phase that is run at the end anyway and it is a percent not a exact answer)… thought for detected/matched things you know, maybe you can write some test.

I know, it is a little strange to say that what is wrote is not TDD, but the problem raises in running blackboxes.

By the way, the referenced post is also helpfull https://blog.openai.com/openai-baselines-dqn/ in understand why this isnt exactly TDD.


That said, it would be nice if github.com/fastai/fastai have its own set of linters/guards that we can use :slight_smile:.

And probably we can use something like this https://stackoverflow.com/a/13404866/682603 (but maybe only write at end to file?) or just dont write at all!!!


And finally, apart of nice linters for fastai, it would be nice to integrate https://github.com/lanpa/tensorboard-pytorch maybe there, we could have a little more of a TDD view, not only best practices (I think TDD can only fit libraries inputs and outputs which are direct/exact matches, not probabilistic ones).


(blake west) #6

Yeah, for an issue like this, it’s important to really get specific on what you’re trying to test. Testing that data transformations did what you expect would be one category. Testing that models are hooked up correctly is another (that seems to be what the blog post was talking about). Testing that your final model gives correct labels is yet another. Often times, the point of testing is to help you narrow down where your error could be occurring. If you have no tests, and the only thing you know is that your model sucks, it could be happening anywhere.
So I think it’s useful to think about each stage of the pipeline, and then design tests (or a library/framework) that focuses on each stage. Like, data collection, data munging, data splits/prep before training, model building, model training, model selection, model evaluation. I’d say each one of those stages could be helped with tests. And I think a lot of the problem is that no one has written a “probabilistic” framework for tests yet. Often ML isn’t really suited to “When X goes in, I expect Y to come out”. Usually it’s messier than that. You want to be like… “After this transformation, I expect that around 80% of the values are between .2 and 0.8, and 100% are between 0 and 1.” Or whatever. I think a library like that is totally doable. But no one has written it yet. And I think that’s a good place for DL / ML to go.


(Jeremy Howard (Admin)) #7

Well said. This is something I’ll be trying to demonstrate through example in every lesson of part 2. If anyone notices me not checking an intermediate step properly, please let me know so I can fix it!


(Sam) #8

The ML Test Score- A Rubric for ML Production Readiness and Technical Debt Reduction.pdf (488.3 KB)

Google wrote a very good paper on testing Machine Learning Models!