Shortest failing test sequence finder (coupled tests detector)

Project:

Write a tool that can reduce a long sequence of failing tests:

pytest testmodule1::testa testmodule1::testb testmodule1::testc testmodule2::testa testmodule2::testb testmodule2::testc  testmodule3::testa testmodule3::testb testmodule3::testc 

to the minimal sequence that still fails, e.g.:

pytest testmodule1::testa testmodule2::testa testmodule3::testc 

The problem is to find a minimum sequence of tests that leads to failure, starting from a long one.

Say you run make test and one test fails (out of currently 240 tests). You run that test alone and it’s fine. You try to bisect the list of tests and it’s fine too. So it’s a certain combination of the previous tests that leads to that one test failure. One or more tests leave unclean state that affects that other test that fails because of that. So in a way we have a coupled tests situation.

The brute force approach is to take the list and remove one test at a time and retest, but we are talking about 240 tests at the moment, total run ~2min, so it’d be quite slow. bisecting is not the right approach either because a group could be involved. So what efficient algorithm would you suggest to reduce 240 to say a sequence that could be 5, 149, 200. There is no telling how many tests are involved, but most likely 2-3, which may help at choosing an efficient algorithm.

To top if off the failure is not consistent and happens once in 5 runs on average, so each sequence will need to be re-run multiple times, but that’s unrelated. It just makes the reduction process even slower.

The current failure I’m dealing with has to do with gpu mem measurements that seems to be affected by previous calls to gpu, so it probably has to do with some leaking objects and some tests not succeeding at a clean teardown.

I was hoping to find a pytest plugin that did that, but couldn’t find any.

Would anybody be interested to contribute such functionality? That would be very useful to the fastai project and surely to other projects using pytest as well.

This post shows how to create a test suite that reproduces the challenge that this project is trying to solve.

There are several parts to this interesting project and you’re welcome to contribute to any or all parts if you want the challenge and, of course, feel you can figure it out:

  1. Part 1a. find an efficient algorithm, see this post
  2. Part 1b. optimize to eliminate items with slowest execution time fist, see this post
  3. Part 2a. integrate it into pytest as a plugin, see this post
  4. Part 2b. Extend the plugin to generate the initial list itself, see this post

Thank you.

1 Like

Part 1a. Find an efficient algorithm to solve the problem.

The first part is not to worry about the test suite. Let’s start with a pure list reduction CS algorithm question.

The question is very simple: given a list of length N, find a minimal sub-list according to a certain criteria. In this case the criteria is that nth item behaves differently (fails) after a specific sub-list was executed, but which behaves normally otherwise.

I wrote a simple demo program to demonstrate the issue:

x = 0
y = 0
def black_box(i):
    global x, y
    if i == 5: x +=1 
    if i == 100: 
        if x: y +=1 # order is important! 
    if i == 230: assert not (x and y), f"{i} failed"

# need to implement the following using black_box as crit_func
def find_min_fail_sequence(items, crit_func):
     min_sequence = []
     # your code here
     return min_sequence

N=240
t = list(range(N))
print(f"Failing sequence: { find_min_fail_sequence(t, black_box) }")
# expecting: 5, 100, 230 for the above black_box implementation

# examples of success/failure
black_box(t[230]) # no problem on its own
for i in t[100],t[5],t[230]: black_box(i) # no problem (wrong order)
for i in t[:100],t[101:]:    black_box(i) # no problem (100 was skipped)
for i in t:                  black_box(i) # fails since 5, then 100 were run

Output:

Failing sequence: []
[...]
---> 25 for i in t:                  black_box(i) # fails since 5, then 100 were run
----> 8     if i == 230: assert not (x and y), f"{i} failed"
[...]
AssertionError: 230 failed

Problem: find the secret i’s that lead to failure black_box(230) in an efficient way - i.e. ideally in much less than N iterations.

  • you don’t know how many i’s lead to a problem, but the algorithm should be optimized for a sequence of 2-3 items (i.e. only 2-3 i’s are coupled, which is the 99% of real life situations), but it should be possible to search for slightly larger coupled groups)
  • the order of i’s is important, so no changing the order of i’s
  • you can’t see the code of black_box.
  • you can’t see x and y (or know their relevance). These globals demonstrate a stateful state that affects the tests.
  • the goal should be to have the least amount of invocations of black_box, outputs of which can’t be cached. This is because the real application is the running tests, so we want to find the failing sequence in the least amount of time.

Currently worst case scenario brute force approach would take about 4 hours to complete. And unlike the black_box example, the test failure in the real problem is inconsistent, and will need to be re-run on average 5 times - that’s 20 hours of runtime and a lot of wasted electricity. And it’ll grow as we have more tests.

Part 1b. Further optimize the algorithm to eliminate the slowest tests first.

So, consider the definition of the problem changes from just having a list of items t to the same list, but now we know the exec time of each item.
So the initial input:

N=240
t = list(range(N))

now changes to:

from random import randint
random.seed(42) # get reproducible results while experimenting
N=240
t = [[i,int(randint(1,100))] for i in range(N)]

so each item is:

[seq_num, exec_time]

since some tests are much slower than others we want to eliminate slow tests from the lists first, which can significantly impact the total run-time. of course, at the end we will have real exec times of each test, but for figuring out an efficient algorithm, random numbers is just fine.

You will most likely want to set a RNG seed while you develop this as I have shown in the code above.

Part 2a. Make a pytest-coupled-detector pytest plugin.

So that we could invoke it as a long sequence of failing tests:

pytest -findcoupled testmodule1::testa testmodule1::testb testmodule1::testc testmodule2::testa testmodule2::testb testmodule2::testc testmodule3::testa testmodule3::testb testmodule3::testc

or most likely it’d be just the test modules:

pytest -findcoupled testmodule1 testmodule2 testmodule3

and it’d split out:

testmodule1::testa testmodule2::testa testmodule3::testc

We don’t need to wait to figure out an efficient algorithm for this part of the project to start. The implementation of this part can be started on right away.

Use the brute-force “remove one item at a time and retry” algorithm to start with. then once Part 1a and optionally Part 1b are completed then those can be integrated into Part 2a.

It should be probably be relatively easy to create this pytest plugin by mimicking some existing plugins that do somewhat similar things.

Final bonus part:

Part 2b. Extend the plugin to generate the initial list itself

So that all we would do is:

pytest -findcoupled 

and it’d just go and run all the tests several times in different orders with repetitions, and if it finds any failures it’d then reduce it to the shortest sequence.

So most likely we can borrow some functionality from the random-order plugins I mentioned in the previous post to seed the initial list and then if and when failures found, then apply the reduction algorithm to that long list.

Prepare a test suite

It should be easy to prepare a fake test suite that emulates the problem. For example, this is how I’d create a quick test suite (unix) to emulate the problem:

mkdir tests
echo -e "import os, os.path;\nif os.path.exists('./tmp1'): os.remove('./tmp1')\nif os.path.exists('./tmp2'): os.remove('./tmp2')" > tests/conftest.py
echo -e "def test_a(): assert True\ndef test_b(): assert True\ndef test_c():\n    with open('./tmp1', 'w') as f: f.write('')\n    assert True" > tests/test_1.py
echo -e "def test_a(): assert True\ndef test_b(): assert True\ndef test_c():\n    with open('./tmp2', 'w') as f: f.write('')\n    assert True" > tests/test_2.py
echo -e "from os.path import exists\ndef test_a(): assert True\ndef test_b(): assert True\ndef test_c(): assert not (exists('./tmp1') and exists('./tmp2'))" > tests/test_3.py

If you look at the resulting files, the sequence: tests/test_1.py::test_c tests/test_2.py::test_c tests/test_3.py::test_c will trigger a failure.

Let’s validate, by listing just the test modules, which by default will trigger this problem too.

pytest tests/test_1.py tests/test_2.py tests/test_3.py

and we get the failure indeed:

――――― test_c ―――――――
>   def test_c(): assert not (exists('./tmp1') and exists('./tmp2'))
E   AssertionError: assert not (True and True)
E    +  where True = exists('./tmp1')
E    +  and   True = exists('./tmp2')

tests/test_3.py:4: AssertionError

This was on the test module level, but we want to get to the test level, so that the failing sequence is:

pytest tests/test_1.py::test_c tests/test_2.py::test_c tests/test_3.py::test_c

indeed it fails.

But if we swap the first test, so that it runs:

pytest tests/test_1.py::test_a tests/test_2.py::test_c tests/test_3.py::test_c

it works just fine - the test suite doesn’t fail.

We can also run the test suite in the test module order that doesn’t trigger the problem

pytest tests/test_3.py tests/test_2.py tests/test_1.py

this succeeds.

note that I created conftest.py which resets the state before each pytest run.

Hm, that’s an interesting topic. So am I right that the idea here is to find coupled tests with some shared or “global” state that affect each other instead of being isolated? And the ultimate goal is to use this tool to make all tests standalone and without hidden side effects?

Correct.

And the ultimate goal is to use this tool to make all tests standalone and without hidden side effects?

Correct. Once it tells you which tests are coupled then the developer can figure out how the different tests impact each other.

Sometimes it’s a test not doing a clean teardown (i.e. badly written test). Other times it’s a bug in the software being tested.

1 Like

This seems to be active research area: https://docs.pytest.org/en/latest/flaky.html
It does not seem [always] feasible to find flaky test in one run like you suggested because it doesn’t scale to number of tests and tests duration. Perhaps accumulating run statistics for some period of time and then analyzing it offline is more scalable (but needs more time to gather).

But I actually wonder why it’s not possible to run each test in a more isolated manner (i.e. starting from clear global state) and nuke everything after each test is done? :slight_smile: This way tests that rely on special global state should consistently fail (or consistently not fail).

Perhaps, I couldn’t find anything existing that would solve this problem, flaky included.

It does not seem [always] feasible to find flaky test in one run like you suggested because it doesn’t scale to number of tests and tests duration. Perhaps accumulating run statistics for some period of time and then analyzing it offline is more scalable (but needs more time to gather).

I never suggested that this can be done in one run. That would be impossible. In which text of my proposal did you understand I was saying so?

But I actually wonder why it’s not possible to run each test in a more isolated manner (i.e. starting from clear global state) and nuke everything after each test is done? :slight_smile: This way tests that rely on special global state should consistently fail (or consistently not fail).

Sorry, I don’t understand your statement. If you cut off the state how would a test fail then.

The whole point is to have a state, so that you could detect problems in software (think leaks), not to hide them via totally independent tests. You will be shooting yourself in the foot then. Of course, this could be done at a test level, but if your tests interact and help you to find such problems - this is a gift and not a curse, if you have the right tools to find out the minimal sequence. If you don’t have the tools, then it can be a curse, for sure.

Sorry, I used wrong word, by “run” I mean a single execution of a tool (pytest plugin) that detects coupled tests. I.e. pytest -findcoupled testmodule1 testmodule2 testmodule3, which itself runs tests many times. My point is that tool execution time seems to be exponential to the number of tests, making such tool less useful with big number of tests, especially if tests involve GPU usage (e.g. running AWS GPU instances).

Hmm, I see. I didn’t think this way. In my understanding tests affecting each other are evil and each test should execute from sterile (specially pre-defined) state. Otherwise, test are not deterministic and you can’t rely on them, or run in random order or in parallel (e.g. on separate machines).
For leaks detection I would expect test framework to detect those (e.g. check memory before and after each test). If the leakage occurs only in specific combination of many operations not covered by a single test… maybe it makes sense to create separate suite of large end-to-end integration tests and detect leakages there. Perhaps, I just don’t understand specific cases with GPU you are talking about.

Thank you for clarifying what you meant, @vova.

I think what you’re trying to say that it’d be better to have an external tool, than invokes pytest as many times as needed with a sub-set of test modules or test names. I thought you could do the same from within a pytest plugin, but perhaps I’m wrong. I haven’t tried implementing it.

If it is not possible to do via a plugin, then an external tool is just as good, that’s a minor implementation detail.

The whole point is to have a state, so that you could detect problems in software (think leaks), not to hide them via totally independent tests. You will be shooting yourself in the foot then. Of course, this could be done at a test level, but if your tests interact and help you to find such problems - this is a gift and not a curse, if you have the right tools to find out the minimal sequence. If you don’t have the tools, then it can be a curse, for sure.

Hmm, I see. I didn’t think this way. In my understanding tests affecting each other are evil and each test should execute from sterile (specially pre-defined) state. Otherwise, test are not deterministic and you can’t rely on them, or run in random order or in parallel (e.g. on separate machines).
For leaks detection I would expect test framework to detect those (e.g. check memory before and after each test). If the leakage occurs only in specific combination of many operations not covered by a single test… maybe it makes sense to create separate suite of large end-to-end integration tests and detect leakages there. Perhaps, I just don’t understand specific cases with GPU you are talking about.

I think we are on the same page. Each individual test should set things up at the beginning of its run and clear the slate when it’s done. But you must not fully reset the interpreter, since that’s exactly where you would miss problems such as memory leaks. So when I say “keep the state” I mean “don’t reset the interpreter” - because when you use your code base in production or research you will have several of those “tests” running together and they must not impact each other.

If you however run each test in a pristine environment, you are very likely to have a lacking test suite.

But there is more to it. For example, I have been struggling to write a memory leak test for fastai, because not only different GPU models behave differently and allocate memory differently, a single GPU behaves differently depending on the code run on it prior to the execution of the current test. The peak and usage memory allocation numbers are inconsistent. At the point pytorch doesn’t provide a way to measure the allocation programmatically so I have to rely on pynmvl which queries the gpu directly. But that’s a different story, unrelated to the leaks at the level of fastai/pytorch.