How to contribute to fastai [Discussion]

Case study: Helping out Sylvain with PRs and Issues

I’ll be honest with you, as of today, watching Sylvain’s bug fixing commits, most of the time I have no idea what he’s doing. It’s all a voodoo stuff which one day I hope to be able to follow and code myself.

But I thought, perhaps, there are some things that he does in his handling of PRs and Issues that I could do instead of him to free up his time, so that he will have more time creating an even more amazing fastai library.

It didn’t take long to find. Perhaps, you don’t know that, but all contributions to the fastai projects require contributors to sign a Contributor License Agreement, and so, often, a new contributor, unaware of this requirement, submits a PR which we can’t accept until the CLA is signed. And usually you’d see the following at the bottom of a PR page:

And then Sylvain would comment:

Please sign this CLA agreement https://www.clahub.com/agreements/fastai/fastai as explained here before we can proceed. Thank you.

And I said, I can do that too! And so can you. And so unless someone beats me to it, I just do it, it doesn’t save a great amount of time to Sylvain, but it all adds up and it also speeds up the submission to merging process, since the contributor might take some hours and sometimes days to comply (as often they have to ask permission from their employers).

Of course, typing that reply every time would be time wasteful, so github has reply templates. Unfortunately, they have to be configured manually by each github user and there is no way to have them pre-set per project. If you’d like to use our templates, it’s all documented here: https://github.com/fastai/fastai/blob/master/.github/issue_reply_templates.md

And once you configured them, here is how you use them. You just click on the right upper corner of the reply box and you get a dropdown with the pre-made replies. Then just pick the one you need, hit [Comment] and you’re done:

Even a total beginner in fastai can do that.

Among the reply templates I linked to above, you will find a few other templates that you can also use to help the maintainers:

  • We don’t deal with install issues in the github Issues but have dedicated forum threads for that (since most issues have already been resolved and discussed in those threads, so the solution is most likely already there). so when someone posts an install question, we just reply with one of the following 2 replies and close the Issue.

    • fastai: install issues [v1]

      fastai 1.0.x installation issues should be reported/discussed here instead. Thank you.

    • fastai: install issues [v0]

      fastai 0.7.x installation issues should be reported/discussed here instead. Thank you.

  • and then we have PRs from contributors who either haven’t read the steps for setting up the fastai repo or they forgot to do it, so when that happens the CI will report a [fastai.fastai (nbstripout_config)] failure. In which case I reply with this template (as so can you):

    • fastai: unstripped notebook

      If your PR involves jupyter notebooks ( .ipynb ) you must instrument your git to nbstripout the notebooks, as explained here. PRs with unstripped out notebooks cannot be accepted.

So here you go, you now have at least 3 ways you can help the maintainers, PR contributors and Issue submitters w/o knowing much about fastai.

If you observe maintainers at work you will notice other little things that you could help with. Just watch the process and then see if you can save them time, by taking over some activities that you understand and feel comfortable at running with. Do not be afraid to make a mistake, it’ll all get sorted out if a mistake happens.

BTW, to save yourself time and not need to click around github a lot, you might want to sign up for email notifications for PRs and Issues in the github fastai projects so that you will get notified when new entries are submitted and you can also see a previews of PR/Issues in the notification emails. And also we have a commit diff mailing list if you prefer to watch diffs emails instead of using github: https://docs.fast.ai/dev/develop.html#full-diffs-mailing-list

3 Likes

Case study: Writing a new unit test and a doc entry for image resize functionality

Recently, I was doing some training setup that involved variable sized images and got stumbled with it not working. I was only able to find examples here and there and even forums weren’t helpful. So since I needed this problem to be solved, I decided to first write a few simple tests so that I could report the bug, and have it resolved. I submitted a similar bug report earlier, but @sgugger couldn’t find what the problem was without me giving him some reproducible code to work with, which I initially failed to provide.

Part 1: Writing the test

footnote: In case you don’t know in the fastai test suite we use small subsets of real datasets, so that the test execution completes within seconds and not hours. These are the datasets that have _TINY in their name, so as of this writing in fastai/datasets.py you will find: COCO_TINY MNIST_TINY MNIST_VAR_SIZE_TINY PLANET_TINY CAMVID_TINY - these are the ones you want to use for testing.

Apparently everything worked just fine as long as transforms were involved, but without transforms it’d just break. And then I still wasn’t very clear on why some examples used the datablock api whereas others factory methods, yet, working with the same dataset. It was quite confusing.

So I started with a simple test running with a fixed size dataset that I knew will work, since I pretty much copied an existing working test, and added some extra verifications to it that weren’t there originally.

    from fastai.vision import *
    path = untar_data(URLs.MNIST_TINY) # 28x28 images
    fnames = get_files(path/'train', recurse=True)
    pat = r'/([^/]+)\/\d+.png$'
    size=14
    data = ImageDataBunch.from_name_re(p, fnames, pat, size=size)

    x,_ = data.train_ds[0]
    size_want = size
    size_real = x.size
    assert size_want == size_real, f"size mismatch after resize {size} expected {size_want}, got {size_real}"

and it worked.

In this test, I setup the data object just like it’s done in the first lessons of the fastai course, and then I check the size of the first object of the train dataset and check that it indeed got resized. I hope you’re with me so far.

The assert does the checking and the last part of the assert is setup to give me a meaningful debug information in the case of the failure. You will see later how it becomes useful.

So this was my baseline and then I could start doing experiments with it by changing things around.

Next, I pretty much did the same thing, but with a variable image size dataset:

    path = untar_data(URLs.MNIST_VAR_SIZE_TINY)

and it worked too.

Then I replaced the factory method from_name_re:

    data = ImageDataBunch.from_name_re(p, fnames, pat, size=size)

with the data block API:

    data = (ImageItemList.from_folder(p)
            .no_split()
            .label_from_folder()
            .transform(size=size)
            .databunch(bs=2)
            )

and it worked with the fixed images dataset, but it failed with the variable size images dataset.

So I submitted a bug report and someone else did a similar one and he had a great test case that reproduced the problem, and meanwhile I decided to expand the test to cover all the various sizes - int, square and non-square tuples, resize methods and types of datasets. First I did it separately for each way of doing it and then started to slowly refactor it to avoid duplicated code. (duplicated code often leads to bugs.)

After many iterations (many of which were just broken), the many tests morphed into a complete unit test that covered 18 different configuration permutations and did it in both possible ways of performing a resize - (1) with the factory method and (2) data block API. Here it is:

# this is a segment of tests/test_vision_data.py
from fastai.vision import *
from utils.text import *

rms = ['PAD', 'CROP', 'SQUISH']

def check_resized(data, size, args):
    x,_ = data.train_ds[0]
    size_want = (size, size) if isinstance(size, int) else size
    size_real = x.size
    assert size_want == size_real, f"[{args}]: size mismatch after resize {size} expected {size_want}, got {size_real}"

def test_image_resize(path, path_var_size):
    # in this test the 2 datasets are:
    # (1) 28x28,
    # (2) var-size but larger than 28x28,
    # and the resizes are always less than 28x28, so it always tests a real resize
    for p in [path, path_var_size]: # identical + var sized inputs
        fnames = get_files(p/'train', recurse=True)
        pat = r'/([^/]+)\/\d+.png$'
        for size in [14, (14,14), (14,20)]:
            for rm_name in rms:
                rm = getattr(ResizeMethod, rm_name)
                args = f"path={p}, size={size}, resize_method={rm_name}"

                # resize the factory method way
                with CaptureStderr() as cs:
                    data = ImageDataBunch.from_name_re(p, fnames, pat, ds_tfms=None, size=size, resize_method=rm)
                assert len(cs.err)==0, f"[{args}]: got collate_fn warning {cs.err}"
                check_resized(data, size, args)

                # resize the data block way
                with CaptureStderr() as cs:
                    data = (ImageItemList.from_folder(p)
                            .no_split()
                            .label_from_folder()
                            .transform(size=size, resize_method=rm)
                            .databunch(bs=2)
                            )
                assert len(cs.err)==0, f"[{args}]: got collate_fn warning {cs.err}"
                check_resized(data, size, args)

It may look complicated, but it’s very very simple - it does exactly the same simple things I described at the beginning of this post, just tests them in 18 different ways, via 3 loops! Remember, it was written in stages and slowly improved upon.

The only new thing that I haven’t covered so far is the CaptureStderr context manager, that we have in our test utils, which helps us test whether fastai emitted some warnings, which most of the time indicates that there is a problem waiting to happen. Therefore, the test needs to make sure our data is setup correctly and doesn’t emit any warnings (this check is done by a function called sanity_check()). So this check performs the validation that nothing was sent to stderr:

  assert len(cs.err)==0, f"[{args}]: got collate_fn warning {cs.err}"

You can do the same using pytest’s capsys method, but ours is a better one, because it’s a context manager, and as such it’s more of a “scalpel”, whereas capsys is a bit of a “hammer” - when it comes to localization of the stderr capturing.

I submitted the test that included also:

@pytest.mark.skip(reason="needs fixing")
def test_image_resize(path, path_var_size):
...

because it was failing. So then we know the that this test needs fixing, and it doesn’t affect our CI (Continuous Integration) checks.

The next morning, Sylvain fixed the bug, removed the test skip directive and voila - we now have the resize without transforms covered 100% and it will never break in future released versions, because the test suite will defend against it.

You can see this test as it was submitted here.

If you want to run this test, you’d just do:

pytest -sv -k test_image_resize tests/test_vision_data.py

And in case you didn’t know - we have a testing guide, which is full of useful notes.

Part 2: Writing the resize documentation

Now I will tell you a big secret. The main reason I write documentation is primarily for self-serving reasons. I’m a lazy person and I don’t like figuring things out all the time. I enjoy the process of figuring out something once, but repeating the same figuring out is just exhausting. Therefore I tend to write down everything I think I might use again in the future. That’s why I write a lot of docs. I am happy to share them with anybody who wants them, but their main use is for myself. Making it public also ensures that if I lose my copy, I can restore it later from the public copy.

Same goes for tests, I write tests so that I don’t need to figure out why my code stopped working when a new release that included some change in the fastai library broke previously working functionality. By writing tests I ensure a future peace of mind for myself. And others benefiting from it is a nice side effect. And my ego is happy!

So back to our case study, now that I wrote this test I knew everything I needed to know about the resize functionality in the fastai library (the user side of it). And since there is so much complex ever changing tech stuff I need to cope with on the daily basis, I know I will forget this hard earned knowledge, so I decided that it’ll pay off to invest a bit more time to write a summary of what I have learned.

And so I did, and now there is a new entry that documents all the possible ways you could resize images in fastai: https://docs.fast.ai/vision.transform.html#resize

It’s literally the same as the test that I wrote, except it’s done in words and organized for an easier understanding.

Then I realized that resizing images on the fly is very inefficient if you have a lot of them and they are large. Therefore I expanded that section explaining how to resize images before doing the training, stealing one example from Jeremy’s class notebook, so there was an example in python and sharing the command line way using imagemagick method I normally use. (and that part of the doc entry could use more examples and more ways of doing that including pros and cons of those different ways. hint, hint).

Conclusion

So you can now tell that most likely before you write documentation you need to understand how the API or a use case you are about to document works, and since the only way to really understand something is by using it you will have to write some code. And if you’re going to write some code anyway, why not write a unit test for the fastai library.

If it’s a use case, involving some specific actions then a small or large tutorial on the steps to repeat your process is called for.

5 Likes

Both of these top posts are up-to-date, i.e. they get updated as things get completed or new needs arise.

this is the idea that you don’t need to read the thread, just the top-post. The thread is long because it includes the discussions on all those topics that are then summarized in the top post.

Perhaps we should apply this split method to all threads, one thread with just the up-to-date summary and another for discussion, because most users don’t realize that in those few threads the first post is special and is not just a starter for the 800-post thread.

Great! I’ll use them accordingly going forward.

Is there a way to pin a post to the top of a thread? I know if the first post is a wiki, we can update it with relevant content but it might be useful to, for example, pin your case studies to the top of this thread. So the first five posts always contain all the information anyone needs. The up voting to summarise route doesn’t work unless there is enough people liking posts.

I think I posted enough by now and it’s your turn, so that next we can put all the case studies into the summary thread. I won’t put mine there until you post yours! :slight_smile:

I don’t think having one person posting all the case studies would make for an inspiring case, since a reader can always find an excuse that she is not wired that way and there is no way he can do something similar - I know it can be intimidating. I know I’m special. And each one of you is. But not all of you choose to believe this is the case.

This is the intention behind these sharings, to show others that each person has something great to contribute. Someone writes notes, someone helps with new users, another person loves trying outrageous things, yet another has 20 GPU cards and they want to finish the training of cats vs. dogs in 5msecs, etc. Which one is you?

Just discovered this inspiring sharing by Sylvain:
https://www.fast.ai/2019/01/02/one-year-of-deep-learning/

2 Likes

I had to research this, I’m new to discourse. There might be better ways.

From reading through the discource forums, it looks like there are two ways to re-order posts:

  1. move all the posts you want not to be at the top to a new temporary topic, which will push the remaining posts up and then move the moved posts back. and it says you have to add a new post first… very hackish but doable.

  2. some kind of posts:reorder_posts rake task https://meta.discourse.org/t/a-way-to-reorder-posts/31532/14 - I have no idea what it means.

FYI Added to low hanging fruit:

If you have any questions or issues please post them in the corresponding thread: Documentation improvements Thank you.

1 Like

A post was merged into an existing topic: Documentation improvements

Case Study: helping with Test coverage

I write up a little case study following @stas example to motivate others to contribute, especially helping with tests.

My background is currently spend around half of my time coding for a living, I consider myself a good, not fantastic coder. I started DL a year ago and my python knowlegde was mostly non-existant even though I know all the concepts from education and other languages. Python is a lot of fun for me, so many cool libraries and that is already a gain for me. I am also co-organising a fast.ai meetup in the hague, Netherlands

I chose testing, for some reasons. One, we get this MOOC for free, crazy right? So why not try to give back a little. Of course, I do it for myself. It’s cool to have commits in a project like fast.ai and testing is a safe bet to find easy tasks that are useful for the project. I consider myself nowhere near to make a substantial PR, maybe could fix some bugs though meanwhile and write tests for them. Tests are just useful for projects to guarantee stable code, they can be sometimes a little drag but they give you a good learning curve and hopefully one day you can get to other areas.

There is a couple of gotchas when you contribute:

  • Contributions might require more time than you expected. So if you work full time, have a private life you might get a little under pressure for a while. It’s not nice starting a project with others and then letting forum posts unanswered, so if you start try to pull through. And be realistic what you can do and be open about it.

  • It is also easy to come up with all kinds of great ideas on features while not having finished the simpler tasks nor having thought through if such PRs are possible and you find the time to do it. So probably good to stick to one really simple task and take it step by step

  • Then, this is meant to be a contribution not a training camp for you. While of course, you will and should learn, am personally quite concerned to cause the maintainers like Stas and Sylvain not more work than value they get back. So, I do my best and probably in the future will even be more careful to what tasks I contribute. Need to be confident I can deliver reasonable fast & smooth.

Overall, this is fantastic stuff and here are the benefits for you.

  • For example, I wrote a little test for fit and fit_one_cycle. Helps to me really understand what’s going on now - and I am now the guy who tests this cool stuff from Sylvain and Lesley? Cool … put it on my CV :slight_smile: Isn’t that much cooler than making another DL cert on some MOOC?

  • I also learnt quite a bit how to automate this whole git process with scripts and python, sth I can directly apply in my professional life.

  • Stas and Sylvain are really great coders, Jeremy gave the code a lot of thought and has all kinds of ideas why eg. functions are written in certain ways. You can just learn with the masters (I get a lot out of listening to Stas reviews!) and hopefully be useful for a great project. Standards are high and you can grow with it.

Go for it, am actually sure most reading this have more time & python skills than I have so it should be totally possible to contribute !

And here 2 concrete ways 2 contribute:

  • You can find ways to contribute tests here: https://forums.fast.ai/t/improving-expanding-functional-tests. Just pic an easy part, there are easy tests, some might be more difficult.
  • Meanwhile, we have a little spin off to automate docs based on tests: Doc_test project .For that part soon, we will need to go through all tests and register then correctly with a function this_tests. The idea is we show the tests to the notebook user and in automated docs. Just pick a test class and place the this_tests in there, when we are ready. As part of this spin off, we need to register existing tests with this_tests to integrate those into a Documentation.
6 Likes

Great write up, @Benudek. Thank you for sharing your story and insights.

May I just suggest that the intention behind inviting the stories is not to motivate anybody. We are not a corporation that needs to motivate its employees. I believe people who come to this thread are already motivated to give back after receiving so much. They just don’t know how and/or are uncertain that they can. Therefore, I feel, the problem is not a lack of motivation, but a need to demonstrate that the contributing can be done on different levels and by users with a very wide range of skills and experience, leading to wide range of contributions from submitting a broken link or a typo fix to a complicated algorithm PRs, overall all being equably important, because it’s the whole that makes a great product and not just its parts.

And to clarify, I’m not asking you to edit your story, just wanting to make a stress on why we are inviting users to share these stories.

4 Likes

Here is an inspiring personal sharing on figuring out how to contribute to an open-source project by Vishwak Srinivasan of the pytorch dev community.

4 Likes

A post was merged into an existing topic: Fastai-nbstripout: stripping notebook outputs and metadata for git storage

Hi Stas,

I’ve been watching the Deep Learning course 2019 and was inspired by the example Jeromy gave of the gentleman that contributed with code. Specially because I’ve been working on something that one of the students asked which was to automatically find the learning rate. I developed an AutoML hyperparameter algo that has been beating random search, and skopt’s Baysian Optimization. I’m using Pattern Search which is MATLAB’s method for finding global optima (my code is in Python 3.6 and I have adapted it to work in the SKLearn framework, so you can call the code just like you call RandomizedSearchCV). Is there anyone that I could show the notebook, to see if it is something interesting to incorporate into Fast.ai?.

Best regards,

Rodrigo.

i am phd student working on remote monitoring of aquaculture sea cages, and i want to use fast ai library to count fishes in a cage can you guide me please ?

I can’t ask question on forums. Any idea?

@stas i would like to contribute to test cases, can please let me know fastai v2 test cases location for test_core.py thanks

Sorry, I’m no longer involved in fastai. Please ask someone else, thank you.

hi all, any advise on where to find test cases file to contribute to in fastai 2. thanks

Hello everyone,
My goal is to submit a jupyter notebook I made that breaks down the steps to obtaining the keys for the Bing Search API. I am trying to follow the contributing guidelines, but I am having trouble installing git hooks with: nbdev_install_git_hooks

One method I have tried is through a command line (I am a noob):

I found “pull requests made easy” which brought me to downloading github desktop, I know enough that I do not want to PR to the master:

However, I feel that I should be executing the jupyter notebook from “pull requests made easy” , but I sadly cant figure out how to download it. I was able to get as far as installing the bash kernel:

I am honestly not sure what direction to head in. Any advice is greatly appreciated.