Developer chat

stas · October 11, 2018, 5:35am

Another thing we need to revisit right away is randomness in tests. I don’t think this is going to fly.

@sgugger, one of the test runs broke during one of the CI runs because accuracy fluctuates from run to run:

tests/test_vision.py F [100%]

=================================== FAILURES ===================================
___________________________ vision::mnist end to end ___________________________

    def test_mnist_end_to_end():
        untar_data(MNIST_PATH)
        data = image_data_from_folder(MNIST_PATH, ds_tfms=(rand_pad(2, 28), []))
        learn = ConvLearner(data, tvm.resnet18, metrics=accuracy)
        learn.fit(1)
>       assert accuracy(*learn.get_preds()) > 0.98
E       assert tensor(0.9799) > 0.98
E        +  where tensor(0.9799) = accuracy(*[tensor([[ 2.0705, -1.8396],

but it’d be very silly trying to chase that lowest possible accuracy and it’d be a mistake since something could change in the code to actually make things worse and we won’t know it.

So it’s very likely that we need to run all tests with a fixed seed. And once set, it should never be changed, since the test writers will rely on those numbers.

It’s probably the best to set it globally in conftest.py and be it known that tests aren’t random. And a huge warning do not change the seed value. I propose 42 as a well rounded, though unoriginal seed value

Hmm, how to you create a random seed then if you needed to do so after you fixed the seed? I guess through a sub-process - or is there a magical way to produce a non-deterministic random number after the seed is fixed in python/numpy/torch?

[sidenote: In fact would it be a good idea to run min < result < max test so that when an optimization happens it breaks the test for a good reason. I guess that is not the greatest idea - just want to make sure we detect both degradations and improvements]

And discussion aside, tests/test_vision.py needs attention.

jeremy · October 11, 2018, 12:55pm

I’m not sure you can think of these integration tests in quite the same way as traditional tests. They are random, and even fixing a random seed doesn’t really help (other things will change over time that will cause the results to change).

test_vision and similar end to end integration tests should not be part of CI, or part of make test. Figuring out how to best handle this testing is going to be a long term project - I’m not sure I’ve seen anyone do it really well yet.

jeremy · October 11, 2018, 2:29pm

Pushed a change that redefines tensor() as follows:

def tensor(x:Any)->Tensor:
    return torch.tensor(x) if is_listy(x) else as_tensor(x)

This should avoid unnecessary clone()s. Let me know if it breaks anything for anyone!

jeremy · October 11, 2018, 2:31pm

There’s now a new pytest marker: slow. Use it like this (e.g):

@pytest.mark.slow
class TestVisionEndToEnd():

Tests marked in that way won’t be run on make test or by CI. To run them, simply say:

pytest --runslow

This marker is designed for end to end training integration tests that probably want a GPU, may need internet access (at least the first time) to download data, and may need manual inspection of failures.

sgugger · October 11, 2018, 2:59pm

In this commit I added two functions to the Recorder:

add_metrics to add new metrics values to be printed.
add_met_names to add their names.

In a Callback that runs before the Recoder, we can then pass new metrics. Examples of uses here: printing the two parts that form a loss (useful for SSD, maybe GANs next) or computing a metric that isn’t a simple average over batches.

Be careful: on_batch_begin and on_batch_end are now called during validation as well, and there is a new parameter train that tells you if we’re in training or eval mode. I adjusted all the existing callbacks in consequence but that doesn’t apply to what you may have done locally.

stas · October 11, 2018, 5:10pm

Well, first CI build just happen to run a lot more tests than we developers do (8 make tests at the moment for each code commit!). But CI build’s make test shouldn’t be any different from developer’s make test.

This is a great discussion we have started on.

We need to have two different sets of test suites.

The functional tests that are deterministic in nature. These can’t have a random element, because then it’s not guaranteed that the same code paths will be made on every test - this is not reliable.
the other tests that I’m not yet sure what they are, that are non-deterministic in nature.

I’m not sure I understand how you can write a test that is not a test and its output should be interpreted by the user. That’s not a test as it’s defined in the software engineering, but something else. I guess you call it integration tests - should that be the name in the fastai domain? So functional tests and integration tests?

In any case the functional tests too must include complete runs of the mill from creating a dataset to analyzing results of the training, except they have to be deterministic. So these would probably be relatively similar tests in both groups, except the functional ones will have a fixed RNG seed, and will be less flexible, and the integration tests will be more like the examples, but then the outputs must not be hidden, like it is with functional test suites, and instead be displayed to the user.

Does it make sense what I wrote? Please correct me if I’m wrong - I’m well versed in functional testing, and the other set is completely new to me. So any RTFM would be welcome, if there is a FM to read on that

elmarculino · October 11, 2018, 5:12pm

Hi,

Anyone else is getting nan loss results running the examples/dogs_cats.ipynb notebook?

Afterlearn.fit_one_cycle(1), printing loss for each batch:

3.33% [12/360 00:07&lt;03:43 nan]
(tensor(1.1369),) 
(tensor(0.8947),)
(tensor(nan),) 
(tensor(nan),) 
(tensor(nan),)

stas · October 11, 2018, 5:30pm

jeremy:

There’s now a new pytest marker: slow . Use it like this (e.g):
@pytest.mark.slow
class TestVisionEndToEnd():
Tests marked in that way won’t be run on make test or by CI. To run them, simply say:
pytest --runslow
This marker is designed for end to end training integration tests that probably want a GPU, may need internet access (at least the first time) to download data, and may need manual inspection of failures.

We need a better name for it. (1) slow doesn’t say anything about the nature of the test, and (2) it’s very confusing to start with - is it for slow machines, or is the test slow, or it should be run slowly I’m sure for those who use it all the time it’ll be loud and clear what it is, but for others it won’t.

Earlier we discussed we need two sets of tests to support the no-GPU setups. The suggestion was:

/tests/light/ - CPU-friendly tests
/tests/heavy/

The one we are discussing right now is neither, it’s neither light nor heavy/slow, it’s something else.

So at the moment I see we need 3 groups of tests:

/tests/light/ - CPU-friendly tests
/tests/heavy/ - normal tests
/tests/random/ or something

Jeremy, I think what would help me is for you to briefly outline this other type of testing, ideally with a simple example. Otherwise I’m really guessing what you might be thinking. Thanks.

jeremy · October 11, 2018, 5:39pm

The ‘slow’ name comes from the pytest docs, so I just used it. I don’t feel strongly about what name is used.

I really only want two sets. Normal fast automated tests that run fine on CPU, are part of CI, and are run with make test. And slow possibly-random ML model integration tests that probably want a GPU. An example of the latter is vision_test (which we will greatly expand, adding tests of the various callbacks etc).

I know the latter option is not what is generally seen in regular software eng. I’m not aware of anyone doing a good job of this kind of testing at the moment, so I’m not sure we’ll find a good role model or best practices. I think we need to create them.

stas · October 11, 2018, 5:56pm

OK, thank you for clarifying the intention, Jeremy. Let’s develop that concept then.

So how do we programmatically test something that is not deterministic in nature. How can we test accuracy for example if it’s unpredictable, yet we don’t set the constraint margin too wide so that the test can tell us if some change in the code base degraded the outcome.

I found that setting a fixed rng seed while developing a model is very helpful, since I can tell the impact of each change I make. Leaving randomness element on was often very misleading. Would you say that this is a good method or not?

and if you’re too busy right now with other obligations that’s no problem, it can wait, just say so if that’s the case and we will revisit it later.

stas · October 11, 2018, 6:00pm

I think it’d be much simpler to simply group those different tests in dedicated folders, rather than needing any fixtures. Why repeat something that can be avoided, by just grouping all tests of the same kind in the same folder.

Are you open to experiment with that suggestion and dropping the slow fixture?

make test will simply change to run py.test tests/light/

jeremy · October 11, 2018, 6:25pm

Using a mark is what the pytest docs recommend. It means that just running pytest in the root of the repo works for everyone as you’d expect. It also makes it easy to have some regular tests and some slow tests within a single file, which is very commonly what I’ll want to do. So this seems like a good approach.

jeremy · October 11, 2018, 6:26pm

I think we set it to something which passes most of the time, and rely on human judgement, at least for now. It’s not a programmatic test, it’s a human-interpretable test.

stas · October 11, 2018, 6:30pm

OK, I guess it’s simple then - once you will write some of those tests I will be able to see what you mean.

Until then i will stick to functional tests.

jeremy · October 11, 2018, 9:52pm

Added a simple developer docs site:

http://docs-dev.fast.ai/

No new info there - just the existing markdown docs in a little jekyll template. It’s rendered automatically from whatever is in the docs/ folder in the fastai repo.

stas · October 13, 2018, 5:29am

Following the bug fix a few weeks back where suddenly the fastai core library became dependent on jupyter_contrib_nbextensions, I wrote this test so that it will catch such dependencies in the future:

# tests/test_jupyter_independency.py
# this test checks that the core fastai library doesn't depend on unnecessary modules.

import sys

# packages that shouldn't be required to be installed for fastai to work:
jupyter_deps = ['jupyter_contrib_nbextensions']

class JupyterFreeImporter(object):

    def find_spec(self, fullname, path, target=None):
        # catch if any of the undesirable dependencies gets triggered
        for mod in jupyter_deps:
            if mod == fullname:
                print(f"detected undesirable dependency on {mod}")
                sys.exit(1)
        return None

sys.meta_path.insert(0, JupyterFreeImporter())

def test_undesirable_dependencies():
    # unload any candidates we want to test, including fastai, so we can test import
    for mod in jupyter_deps + ['fastai']:
        if mod in sys.modules: del sys.modules[mod]
    # test
    import fastai
    #print(f"got: {sys.modules['fastai']}")

2 questions:

Which modules should I add to test against, besides jupyter_contrib_nbextensions?
What fastai core modules should it test, besides just loading fastai? or is that sufficient?

Thanks.

p.s. I guess I should rename the class/test to something more generic since perhaps we won’t want some non-jupyter modules to depend on.

jeremy · October 13, 2018, 9:52pm

Few useful little updates:

added a new DataBatch.normalize(stats) method to add a normalize transform to any DataLoaders in the DataBunch. If you don’t pass stats it’ll calculate them from one random batch. stats is in the same format as normalize_funcs() takes
I added a bn param to simple_cnn to add batchnorm
I changed default_device and default_cpus to default.device and default.cpus since it wasn’t actually working before if you set it
Added a bunch of tests, so hopefully most of the things you might want to do now have examples, for instance here’s tests of DataBatch.normalize: https://github.com/fastai/fastai/blob/master/tests/test_vision.py

stas · October 14, 2018, 4:06am

update: If you install pytest-random-order you can shuffle test files randomly, but at the moment we want the deterministic test module order.

For more information see this.

Since pytest-random-order gets activated automatically upon install (which is very odd) it has no pytest option to be activated to execute. The side-effect is that now any tests suite anywhere in your python environment will be randomized (file/module-level). It may be a good thing or not.

I already detected a bug in the test I added this morning using the random test order. I wasn’t aware pytest didn’t reset the interpreter for each test, so can’t go sloppy.

So probably have it uninstalled most of the time. Perhaps we can find another python package that doesn’t impose itself.

jeremy · October 14, 2018, 4:54am

In the above, @stas meant fixtures, not tests, are in random order. But we’re going to revert that change.

stas · October 14, 2018, 4:58am

I edited the post to reflect the state of randomness.

While looking for a friendly randomizer I found this package: https://github.com/ftobia/pytest-ordering - so if we have some modules that we want to run in a particular order, this module might be useful. I haven’t checked its quality though. But posting here in case we need it later.