Well, that’s why it’s incoming - I haven’t thought it through, just feeding the queue based on user reports - as someone reported his training got ~6 times slower. It could be totally unrelated to fastai, but we have no numbers to tell one way or the other.
I since did a bit of reading on the topic, this is not something we could instrument in the test suite I think. And this is definitely not a problem of how this is implemented - so it’s not about decorators. It’s about how could we do a portable benchmark, given that every execution on the same computer is non-deterministic, and then how can you measure across completely different systems. It’s a very hard problem.
Most likely the simplistic approach could be:
- checkout a release tag A
- run the test suite N times, throw out outliers, and average the rest. (e.g. using dumbbench)
- checkout another release tag B
- repeat (2)
- compare the results.
Now let’s say tag B got slower by 20%, how do you know which of the many tests lead to that? And what if some functions got much faster while others much slower, compensating for each other. It’s a very complex problem.
If that were solved finding the guilty commit should be easy by bisecting the checkouts and repeating the same 5-step process above.
The non-deterministic nature of the machine learning algorithms can be remedied with a fixed seed. So this one is not a problem if done correctly.
And, of course, all that has to be done on the same setup. i.e. you can’t compare my results to yours.
Of course, that doesn’t mean this can’t be done to detect obvious regressions, where performance drops by 20+%. We probably need to write a few test-like benchmarks and re-run those occasionally on the same setup against various recent tags. If CI setup gives instances with identical specs all the time, that would be a good way to have this automated. Or alternatively do all the benchmarks on the same instance.
And then I’m sure Jeremy would probably detect a speed regression by just looking at the speed of the progress bar. So perhaps we should just ask him.
Thoughts and Ideas are welcome.
p.s. But then the fastai API is still unstable, so how would one benchmark code if different releases require different API. So, the benchmark or the test suite would be different between each release. I haven’t thought of it right away, and this comparing apples to oranges complication probably puts the last nail in the coffin. At least until the API is stable.
Here are some useful materials I have just read/skimmed on this topic:
- Welcome to pytest-benchmark’s documentation! — pytest-benchmark 4.0.0 documentation
- https://pyperformance.readthedocs.io/ (compares different python versions/implementations)
- How to do performance micro benchmarks in Python - Peterbe.com
- GitHub - tsee/dumbbench: More reliable benchmarking without thinking && Your benchmarks suck! | Steffen Mueller [blogs.perl.org]
- My journey to stable benchmark, part 1, part 2 (deadcode) and part 3 (average)