Developer chat

I know that the library part for the bounding boxes is still in development, but I was playing around today with the ObjectDetectDataset dataclass and wanted to add a dataset with the label ‘background’ (i.e., no bounding box).

In this case a ‘background’ label is added to the ‘background’ label from the dataset in the init of the ObjectDetectDataset dataclass (data.py, line 160). In order to have just one background label I adapted this line to check if there is a ‘background’ label present in the data (and preserve entry 0 for ‘background’):

self.classes = ['background'] + list(self.classes) if 'background' not in self.classes else ['background'] + list(self.classes - {'background'})

(Removing the ‘background’ from the classes set was the best way to still have the ‘background’ label as “label number 0”. If this is not necessary the else part can be changed to list(self.classes).)

If this makes sense and is useful I am happy to make a PR.

Best regards
Michael

We use this at work: https://msztolcman.github.io/versionner/ when it comes to versioning and the usage is easy( ver up --level, ie. ver up --patch) it uses semver. I would like to get more accustomed to the lib, but have no idea how stuff works, so maybe you need test for some areas that would be write able for a new person to the lib?

1 Like

I still need a sanity check on the above (I removed “delete the branch” from the quote, as it’s not relevant to this discussion).

Thank you.

I was assuming we’d tag the release branch, not master.

Ah, yes, then there is no conflict.

and I have just tested that a tag on a branch still shows up in the releases:

So all is good then. Let’s follow this path then.

Thank you for clarifying, Jeremy.

1 Like

Just pushed a change to normalize_funcs that delays moving the mean/std to the gpu until you call it. It doesn’t seem to slow it down. The benefits are that it will avoid any cuda stuff happening in import (which some folks found really slow), and it should work better with multi-gpu (not tested).

We can now have Table of Contents auto-generated in select .md files, I added it to 3 long files.

e.g. see: release.md

but if you want to add them into others just add:

<!--ts-->
<!--te-->

into the .md file and run:

tools/gh-md-toc-update-all

sorry, it won’t work on windows w/o bash, since the script that I found is in bash.

Should I add one to README.md? I wasn’t sure as there aren’t many entries. Or any other files?

The script relies on github api so without auth you only have 60 queries an hour…

Thanks for the TOC. I don’t think README needs one, personally.

1 Like

Testing, testing, testing!

PRs like this are a music to my ears

Ideally we should require/encourage tests whenever someone submits either a bug report or a PR that’s fixing something. I guess it’ll take a while before we have enough good tests to model from, so for now it’s the brave souls like Fred who just figure it out.

Two questions wrt the new fastai test suite:

  1. test suite datasets.

    a. It’d be foolish to use full datasets in the test suite. It’ll take forever to run and nobody will be running it.

    So how do we go about designing custom datasets that are compact, yet, big enough to validate a certain feature.

    b. they most likely should reside in the test suite folder and not be downloaded (extra time, and potential problem if someone is offline and wants to run the test suite).

  2. we need a protocol of how to merge those test PRs.

    a. we can’t merge them until the bug is fixed, since it’ll break the CI build. So either we need a separate branch where such PRs are merged too, but it’ll probably make things too complicated. Or perhaps we just ack the submitted PR after validating it separately, then when the bug is fixed the PR can be merged

    b. except by that time the dev in charge of the issue will probably have built upon the initial test (improved/expanded it) and so the initial PR will no longer be relevant. How do we credit the PR submitter for their hard work?

Thoughts?

2 Likes

In this particular case, the build fails for another reason (‘en’ language model not installed, which means python -m spacy download en wasn’t executed), as the bug has been corrected yesterday.

I don’t think there’s an easy answer, other than just “carefully”. :slight_smile: I’ve already created an MNIST subset that works well for integration testing on a GPU (<5 secs for an epoch, which is enough for us to test probably most parts of the library - although I haven’t added many of those tests yet). I’m sure we can do something even smaller for CPU users. The approach I suggest is that we have just one integration test per application (image classification, text classification, etc) and the setup for that test runs an epoch with lots of callbacks included, and then the various tests then check different parts of that (e.g. check learning rates, accuracy, etc).

We should mark them xfail, I guess.

https://docs.pytest.org/en/latest/skipping.html

1 Like

Are you referring to this one?
https://github.com/fastai/fastai_docs/blob/master/data/mnist_sample.tgz

If so if are already including it in the test suite, let’s put it under $(fastai_repo)/data/? or should it rather go under tests/data/?

that’s for sure, but I meant we won’t be able to do merge via github interface, as it will require an extra post merge commit on behalf of the person doing the merge to add xfail, so there will be always a CI build fail in between. I mean it’s not the end of the world, but it’s just not the best.

The main thing about causing CI build to fail even temporarily, is that any PRs submitted during that period will be marked as failed to - which would be very misleading to both the contributor and the reviewer. That’s why I’m raising this concern.

So probably a manual merge, which will include an extra change to add xfail will be required for such PRs.

The test will automatically download the first time, if it’s not there. I think that’s the best approach for a larger dataset like this that won’t be part of CI.

IMHO, ‘make test’ should require no Internet access. I know this is unheard of for most people nowadays, but when traveling it’s very easy to find oneself w/o internet. At least the places I tend to go to, no internet is normal.

I’m talking about the small datasets that can be made part of the repo.

I don’t expect ‘make test’ to run the gpu integration tests by default. Also, note that you only have to run it once and then you don’t need internet any more.

I don’t think the MNIST dataset is probably quite in that category - we need much smaller ones which run quickly without a CPU.

Perhaps we are talking about different datasets? This one is:

3.1M mnist_sample.tgz

You think this one won’t be suitable for CPU still? (i.e. too slow)

Sorry I didn’t realize it’s so small! But didn’t you say the vision test took 30 mins on your CPU?

That’s correct. I have a 5-year old i7-4770K @4300MHz, so perhaps it’s not a good indicator. I will be replacing it with a modern one soon.

But that’s a good start and we can refactor down the road.

I was mainly concerned with the size of the dataset so that it won’t be blowing up the repo size more than it already is.

Yeah I still don’t think that particular dataset should be added to the repo.

1 Like

BTW, on CI builds the whole current test suite takes about 2-5min, so its CPU is at least 6-10 times faster than mine. I’m not sure what virtual hardware we get allocated for each host.