Benchmarking fastai

stas · January 5, 2019, 5:46pm

This topic is for discussing and implementing a fastai benchmark, so that we have a way to detect regressions in the fastai performance. This project is not about comparing fastai to other frameworks.

The current stage of this dev project is discussing ways it could be implemented.

Here are some useful materials on this topic:

Welcome to pytest-benchmark’s documentation! — pytest-benchmark 3.4.1 documentation
https://pyperformance.readthedocs.io/ (compares different python versions/implementations)
How to do performance micro benchmarks in Python - Peterbe.com
GitHub - tsee/dumbbench: More reliable benchmarking without thinking && Your benchmarks suck! | Steffen Mueller [blogs.perl.org]
My journey to stable benchmark, part 1, part 2 (deadcode) and part 3 (average)
Tracemalloc: tracemalloc — Trace memory allocations — Python 3.10.7 documentation
Pyflame: A Ptracing Profiler For Python — Pyflame 1.4.0 documentation

This is a wiki post, so please improve it if you have something to contribute.

At some point later, this effort should get synced with automated testing, see this thread: Improving/Expanding Functional Tests

stas · January 5, 2019, 3:02am

Well, that’s why it’s incoming - I haven’t thought it through, just feeding the queue based on user reports - as someone reported his training got ~6 times slower. It could be totally unrelated to fastai, but we have no numbers to tell one way or the other.

I since did a bit of reading on the topic, this is not something we could instrument in the test suite I think. And this is definitely not a problem of how this is implemented - so it’s not about decorators. It’s about how could we do a portable benchmark, given that every execution on the same computer is non-deterministic, and then how can you measure across completely different systems. It’s a very hard problem.

Most likely the simplistic approach could be:

checkout a release tag A
run the test suite N times, throw out outliers, and average the rest. (e.g. using dumbbench)
checkout another release tag B
repeat (2)
compare the results.

Now let’s say tag B got slower by 20%, how do you know which of the many tests lead to that? And what if some functions got much faster while others much slower, compensating for each other. It’s a very complex problem.
If that were solved finding the guilty commit should be easy by bisecting the checkouts and repeating the same 5-step process above.

The non-deterministic nature of the machine learning algorithms can be remedied with a fixed seed. So this one is not a problem if done correctly.

And, of course, all that has to be done on the same setup. i.e. you can’t compare my results to yours.

Of course, that doesn’t mean this can’t be done to detect obvious regressions, where performance drops by 20+%. We probably need to write a few test-like benchmarks and re-run those occasionally on the same setup against various recent tags. If CI setup gives instances with identical specs all the time, that would be a good way to have this automated. Or alternatively do all the benchmarks on the same instance.

And then I’m sure Jeremy would probably detect a speed regression by just looking at the speed of the progress bar. So perhaps we should just ask him.

Thoughts and Ideas are welcome.

p.s. But then the fastai API is still unstable, so how would one benchmark code if different releases require different API. So, the benchmark or the test suite would be different between each release. I haven’t thought of it right away, and this comparing apples to oranges complication probably puts the last nail in the coffin. At least until the API is stable.

Here are some useful materials I have just read/skimmed on this topic:

Welcome to pytest-benchmark’s documentation! — pytest-benchmark 4.0.0 documentation
https://pyperformance.readthedocs.io/ (compares different python versions/implementations)
How to do performance micro benchmarks in Python - Peterbe.com
GitHub - tsee/dumbbench: More reliable benchmarking without thinking && Your benchmarks suck! | Steffen Mueller [blogs.perl.org]
My journey to stable benchmark, part 1, part 2 (deadcode) and part 3 (average)

Benudek · January 5, 2019, 10:29am

@stas I would suggest we open a separate thread ‘Performance GPU Test’ and discuss and design the issue there, while linking these 2 topics here and listing this new task here: Dev Projects Index

Its probably best to keep this task here simple, here we do simple functional tests with asserts and fake data, which should deliver repeatable results on any CPU.

What you describe would require more data, GPU runs and would be runnable with potentially different results on different machines. It might also require some smart debugging code added for tracking down memory consumption. And one might want to consider to add some (not all, too slow I suppose) of those performance tests to run on commits to find culprits immediately.

Maybe at some point one would run both layers of test with a simple flag-switch on some occasions, e.g. before major releases. Often though, these tests would run independently of each other.

Next to the links you shared, I could imagine other Deep Learning projects have best practices one might want to check here. Suggesting we discuss this topic with @sgugger and as you mentioned, probably Jeremy would also have an opinion here.

Kaspar · January 5, 2019, 10:59am

I think i would be a great progress just to have som perf measurement running each day. Any severe regression in the measured area would then stik out. Like a couple of weeks agoe where perf of LM training on mac drop 25% overnight

Benudek · January 5, 2019, 11:03am

@Kaspar : no doubt it would be super helpful and relevant. I just suggest we separate different levels of testing and ensure they can run independently of each other.

Performance Tests can be notoriously complex, even more so if they are supposed to work as automated regression tests (which actually they should!). It would be great stuff though, hence I suggest we open a separate thread and define requirements and make a design there.

Happy to help there too, e.g. if there is a repeatable pattern and sth like a simple switch ‘perf - test this object’, could add this while touching test classes for the simple tests.

Benudek · January 6, 2019, 10:10am

@stas thinking about it, I wonder if the best approach for this might be to redefine the task as a documentation chapter instructing folks, how to profile their fast.ai code.

If one automates this, chances are e.g. with nightly runs one would get a lot of false positives that sth is supposedly slow in code, which raises also the question what a proper time - benchmark per platform and task would be anyways.

If we would have a chapter here https://docs.fast.ai/: ‘Profiling’ and would pick an example (e.g. the one that was reported to you) on one cloud platform, e.g. google then whoever believes sth is slow could analyze the issue in detail and report it then. We could give an example test scripts for fit / fit_one_cycle with a bulkified version for fakes.py for data.

Longrun, such a chapter could also be used to do some due diligence before major releases on performance. Would also suggest to try finding a profile mechanism, that doesn’t require code changes and could be used in production, like e.g. https://pyflame.readthedocs.io/en/latest/

Just my 10 cents ;-)With more time on my hands, would love to help.

@sgugger @Kaspar

Benudek · January 6, 2019, 10:47am

To elaborate on my point @stas . Following the example in your above link one can easily time a simple test function:

def test_fit(capsys):
learn = fake_learner()
learning_rate = 0.001;
weight_decay = 0.01;
learn.fit(epochs=3, lr=learning_rate, wd=weight_decay)
assert learn.opt.lr == learn.lr_range(learning_rate)

with e.g.

def test_my_stuff(benchmark):
# benchmark something
result = benchmark(test_fit)
assert result == 123

But with what to replace the ‘123’ in the assert? Just taking the median of some runs like essentially described here and here from your above links, sounds good for ‘normal’ code but doing that for a fit function would require some extensive runs on different platforms and GPUs and be highly dependable on the specific setup chosen.

So, maybe in profiling, one would need to differentiate between the learners vs ‘normal code’. While one could probably come up with metrics how long ‘normal code’ should run max, I doubt there would be too many issues. Analyzing the fit function might be more sth for a manual task, if one suspects an issue.

Interested in your PoV

stas · January 6, 2019, 5:33pm

@Benudek, please, re-read Benchmarking fastai - #3 by stas, checking absolute numbers measured in time is not a proper benchmark – it simply won’t work because there are too many moving parts, and therefore the code never runs on a deterministic identical system, so you can’t ever get 2 identical runs. So this approach would be very unstable, misleading and giving a ton of false alarms.

@stas thinking about it, I wonder if the best approach for this might be to redefine the task as a documentation chapter instructing folks, how to profile their fast.ai code.

Again, we aren’t talking about benchmarking user’s code - you’re inviting even more trouble since this will lead to an even less controlled environment, with users forgetting this and that, and result in a ton of false alarms. We need to benchmark fastai in a controlled environment.

IMHO, what we need is to start with a few simple training sequences (say, one text, one vision) with a fixed seed and that doesn’t take hours, that will use a CI and the benchmark will be running the same sequences multiple times as described here (dumbbench or like it) for the current code base and a few other tags, say last 5 releases. This requires that the API of those releases relevant to those sequences doesn’t change, so that we compare apples to apples. Over time more releases will be added so we get better reference points.

Then as the process gets polished and we learn more about it (nuances of fastai, pytorch, and DL code in general), more sequences can be added and hopefully the API stops fluctuating to support that or only the stable API is benchmarked.

Benudek · January 6, 2019, 5:48pm

ok, thx. yes - in a controlled environment where we can generate repeatable results that is ideal. I wonder actually about a scenario, where user A runs this on AWS and user B on GCP and we would see differences. Analyzing that might be hard to automate imho - therefore the idea to make a documentation section to instruct a manual analysis. Anyways, maybe that is hopefully more an edge case to be considered then.

Would think as you said, some tests with a subsequent, small design would be next step.

stas · January 6, 2019, 5:52pm

You will always see a difference in such a case. You must benchmark apples with apples, and oranges with oranges, AWS is apples, GCP is oranges in this case.

Benudek · January 6, 2019, 5:55pm

ah yes, yes pls keep it simple for me GCP AWS

Maybe I get to do some tests, would love to check that. Not sure I find time soon

krishnakalyan3 · March 6, 2019, 6:32pm

Thank you for the detailed discussion on performance testing. I would like to summarise performance testing below (Please let me know if I have missed out something)

Benchmarking Unit Test over N runs for every release
CPU comparison FastAI / Pytorch / TF (training and inference)
GPU comparison FastAI / PyTorch / TF (training and inference)
Integrating performance test suite to nightly builds

Reference:

If every thing looks okay, I would like to begin by working on CPU comparison and eventually get to all of the above points mentioned.

stas · March 6, 2019, 11:26pm

I’m not quite sure what you’re proposing, @krishnakalyan3.

Perhaps, if you use some examples it’d be easier to understand your intention.

But, of course, don’t let my comment stop you from doing what you think will be helpful. I just don’t understand what that is…

And while you’re at it, remember that DL training is done on GPU and inference on CPU/GPU so, ideally, GPU benchmarking should be of the highest priority. While pytorch training runs just as well on CPU, most of the time it’s much much slower and thus you’d be addressing a tiny segment of use.

krishnakalyan3 · March 7, 2019, 12:11pm

For now I just proposing benchmarking CPU inference. If possible compare it with other frameworks like TensorFlow.

This inference will be done on pre-trained models like resnet / vgg etc. Do you think there is value in doing this?.

stas · March 7, 2019, 4:11pm

I’m a beginner in DL, so hopefully some other more experienced users can answer that.

My main intention behind starting this discussion was to make sure we compare fastai to fastai - to ensure that the code base doesn’t regress over time.

krishnakalyan3 · March 8, 2019, 3:31am

@stas thank you for your amazing contributions to the fast ai library. I am a beginner in DL too. Comparing fast ai to fast ai is the next thing I will take up.

stas · March 8, 2019, 3:35am

Thank you for your kind words, @krishnakalyan3.

Your main challenge will be the ever changing API. How does one compare an apple to an apple, if you can’t trust it’s still an apple tomorrow. Well, it’s still an apple, but it may grow horns. So this is going to be difficult, as you will need to implement the exact task w/o any variations, but constantly adjust to the new API

Tom2718 · March 11, 2019, 7:46pm

It might be of interest to this thread, but you can get the number of FLOPs and parameters using this (small) library. I made a PR and added all the stuff for fastai models, except the Flatten module which can be added as a custom module without any ops. In particular I wanted to count FLOPs of different models in this Kaggle kernel - which also handles Flatten.

stas · March 11, 2019, 8:42pm

This is interesting, @Tom2718. Thank you!

This will measure only the python side of things, correct? i.e. not CUDA which is where most of the heavy lifting is done. If we don’t measure the whole system, how can we tell that we may have a regression, other than telling that there is more code being run on the python side?

Tom2718 · March 12, 2019, 1:52pm

I’m not so sure I understand what you mean. The number of FLOPs and parameters is not specific to python but rather to the model itself. It’s abstracted away from whether you run on a GPU or CPU. It’s not the entire picture of course but can indicate relative performance.