How to contribute to fastai [Discussion]

Ideally, this should be the responsibility of the documentation, and not forums. So if there is a set of docs that caters to different levels of users then you don’t need to worry about forums at all.

For example now we have:

So if you were to ignore forums and start with docs, and only use forums when the docs don’t cover something (or you’re following lessons), then you have all your bases covered.

Of course, we need a lot more tutorials and better api docs, but the structure is there. It just needs to be filled out more.

The problem with forums is that when you have thousands of users saying anything they would like to say it’s very difficult to create something that is of value to a general user. And it’s very overwhelming when you have threads with 1000 posts - ouch!

That’s why curating content makes a huge difference to the quality of user experience, and that is why forums should be considered to be a sort of playground discussion and the important outcomes should be summarized and placed in intuitive places of the documentation.

If you keep that in mind and focus on better docs, then forums stop being an issue. Does it make sense?

3 Likes

I would be glad to work with you on both of these topics :slight_smile: I must admit I never really got my hand dirty in NLP until now (I focused on vision so far) but I’ve been meaning to do so.

Do you find the post linked by Stas in the first message of this post to be useful, or is it stil too long and intimidating ? If so, do you have any suggestions ?

I have a feeling what would be helpful is to start collecting various case studies of how we have been contributing to the fastai project, in particular from the earlier days of our involvement, when we were in the same shoes of the new comers and often, green newbies.

This is not a competition, but I suggest that readable, instructive and inspiring case studies will be moved to the summary thread: How to contribute to fastai so that users won’t need to hunt for them. Remember that the intention is not to trump your horn, but provide inspiration bites for others to follow and realize that are much more capable to contribute than they believe themselves to be.

I will post a few case studies of my own, and please share a few of yours each. Thank you.

1 Like

The “How to contribute” wiki is great. Thanks for writing it! It’s long but the highlighted portions help navigate it.

It’s missing an easy way to see “what’s currently tbd/open”. You may want to link the Dev Projects Index post. Though I’m not sure if that’s up to date either!

Likewise with the documentation post, it’s not clear from the first post how many of those topics still need work and how many are done - I had to go through the entire post and still wasn’t sure, so started looking at the documentation and comparing with the codebase and got terribly lost somewhere in that process.

Oh and yes, I’d love to collaborate - will DM.

Case study: Bringing outside domain expertise

I will share a few examples of how I was able to contribute to the fastai project at the very beginning of my involvement when I couldn’t help directly with neither code nor docs, since I didn’t have any relevant expertise in either.

A magical release process

When I decided I wanted to contribute to the fastai project in the fall of 2018, I have just started learning python and knew almost nothing about ML/DL, but I did have many years of experience in other domains and in particular full time involvement with open source projects. So I asked what I can help with and was suggested that the fastai project needed a process to make releases.

I said, “great, I know a little bit of bash and make” and so slowly slowly I started creating various make targets, each doing a single task and documenting in what order they needed to be run. Being a bit of a perfectionist I didn’t like the process of copy-n-pasting different stages and typing extra commands as it was very error prone and would make a release manager think hard before agreeing to make a new release. So I had to work much harder to make fool-proof make targets, that depend on each other and validate each other, and after a few weeks of thinking an experimenting the magic release process was born. Now when the release manager needs to make a new release all they need to do is to type:

make release

and the system will do many checks, install required component, test that everything is committed, create a release branch, update CHANGES file, bump the version, switch back and force between the release and master branches, commit the right things to the right branches, build conda and pip packages, upload them to pypi and anaconda servers, wait till the servers make the new version available, test that they need to install it and finally switch back to the master branch.

And of course, each step is documented in https://docs.fast.ai/dev/release.html

Out of the box solution to version number maintenance

While building the release process, I had to solve the version maintenance. I looked at the solutions used by other projects and they looked clunky, requiring complex dependencies, and overall not satisfactory for an extremely trivial problem of moving from 1.0.42 to 1.0.43, plus supporting 1.0.43.dev0 dev versions.

I first thought that it’d be good for me to learn how to do that in python, but then I quickly discarded that since python is not a one-liner friendly language, and Makefile works the best with unix tools like bash, awk, sed, and since for many years I have been programming in Perl and perl is installed pretty much everywhere on unix system, I said, hey why not do a quick solution in perl. So I did:

perl -pi -e 's|((\d+)\.(\d+).(\d+)(\.\w+\d+)?)|$o=$1; $n=$5 ? join(".", $2, $3, $4) :join(".", $2, $3, $4+1); print STDERR "Changing version: $o => $n\n"; $n |e' version.py

and another variation for adding .dev0 and the problem was solved. Now you just need to say: make bump and the above command does it all and even tells what it did.

(and yes, it can be further simplified, but it’s written that way because since then I added bump-minor, bump-major, etc. targets, so it’s easier when it’s broken down into elements - see https://docs.fast.ai/dev/release.html#version-bumping)

Our version number is just a string that we bump, but can also be changed manually if need be. And most importantly there is only one place where it’s set and the rest of the system, takes cues from that file no matter in what form and shape it’s then used and it’s used in quite a few places.

While it’s possible that down the road this solution will be replaced with something else (say if perl will be removed from common unices), it has been worked perfectly fine so far.

Simplifying the PR making branch

git is a very hard tool to figure out. One can learn to pull/commit/push relatively quickly, but anything beyond can be quite a journey for most people new to it. So understanding git is always a big hurdle for making the PR process easier. I tend to create cheat sheet files about everything I learn, and so one of these files was about making a PR with lots of copy-n-paste instructions with comments on what they do.

Being a total git newbie most of the time I was just copying the instructions and was finding it very frustrating, especially when something wasn’t going right (e.g. when my master fork wasn’t synced with the main master).

So I said let’s automate it and let a program figure out whether we need to fork and/or update the forked master, how to setup the upstream and a whole bunch of other boring technical details that nobody really wants to think about when all they want to do is to contribute a one character typo fixing PR… or, perhaps something more serious.

So I spent a lot of time reading stackoverflow posts and was able to turn my cheat sheet into a simple bash script. Now if you want to make a PR branch no matter how big or small it is and no matter whether you’re a git expert or someone who is totally new to it, you just need to type:

fastai-make-pr-branch.py ssh stas00 fastai fastai typo-fix

and that’s it, you now just need to apply your fix, commit and push and you’re done. It’s documented here.

Moreover @devforfu has been working to port this bash script to python so that windows users without bash can benefit from it too. And here you can see delegation at work - Ilia is much more experienced in python and he would do a better and faster job doing the porting.

Also sharing all kinds of related notes often helps, this file was primary my cheat sheet: https://docs.fast.ai/dev/git.html, and then someone said, hey dude, you should share it, so I did. So if you have things you find useful for yourself share them and you’d be surprised that a lot of other people would benefit from it.

Conclusion

Often, when you are just starting with a new project in a new language or domain, you most likely still have a good know-how from your previous projects that that could of great use. Just think out of the box and remember that it’s ok if your solution is not perfect, or not following the accepted norms, or in the wrong language - that’s a start, you can always improve things afterwards, the important thing is to get things done, so that you feel good and you helped someone to save time.

7 Likes

Case study: Helping out Sylvain with PRs and Issues

I’ll be honest with you, as of today, watching Sylvain’s bug fixing commits, most of the time I have no idea what he’s doing. It’s all a voodoo stuff which one day I hope to be able to follow and code myself.

But I thought, perhaps, there are some things that he does in his handling of PRs and Issues that I could do instead of him to free up his time, so that he will have more time creating an even more amazing fastai library.

It didn’t take long to find. Perhaps, you don’t know that, but all contributions to the fastai projects require contributors to sign a Contributor License Agreement, and so, often, a new contributor, unaware of this requirement, submits a PR which we can’t accept until the CLA is signed. And usually you’d see the following at the bottom of a PR page:

And then Sylvain would comment:

Please sign this CLA agreement https://www.clahub.com/agreements/fastai/fastai as explained here before we can proceed. Thank you.

And I said, I can do that too! And so can you. And so unless someone beats me to it, I just do it, it doesn’t save a great amount of time to Sylvain, but it all adds up and it also speeds up the submission to merging process, since the contributor might take some hours and sometimes days to comply (as often they have to ask permission from their employers).

Of course, typing that reply every time would be time wasteful, so github has reply templates. Unfortunately, they have to be configured manually by each github user and there is no way to have them pre-set per project. If you’d like to use our templates, it’s all documented here: https://github.com/fastai/fastai/blob/master/.github/issue_reply_templates.md

And once you configured them, here is how you use them. You just click on the right upper corner of the reply box and you get a dropdown with the pre-made replies. Then just pick the one you need, hit [Comment] and you’re done:

Even a total beginner in fastai can do that.

Among the reply templates I linked to above, you will find a few other templates that you can also use to help the maintainers:

  • We don’t deal with install issues in the github Issues but have dedicated forum threads for that (since most issues have already been resolved and discussed in those threads, so the solution is most likely already there). so when someone posts an install question, we just reply with one of the following 2 replies and close the Issue.

    • fastai: install issues [v1]

      fastai 1.0.x installation issues should be reported/discussed here instead. Thank you.

    • fastai: install issues [v0]

      fastai 0.7.x installation issues should be reported/discussed here instead. Thank you.

  • and then we have PRs from contributors who either haven’t read the steps for setting up the fastai repo or they forgot to do it, so when that happens the CI will report a [fastai.fastai (nbstripout_config)] failure. In which case I reply with this template (as so can you):

    • fastai: unstripped notebook

      If your PR involves jupyter notebooks ( .ipynb ) you must instrument your git to nbstripout the notebooks, as explained here. PRs with unstripped out notebooks cannot be accepted.

So here you go, you now have at least 3 ways you can help the maintainers, PR contributors and Issue submitters w/o knowing much about fastai.

If you observe maintainers at work you will notice other little things that you could help with. Just watch the process and then see if you can save them time, by taking over some activities that you understand and feel comfortable at running with. Do not be afraid to make a mistake, it’ll all get sorted out if a mistake happens.

BTW, to save yourself time and not need to click around github a lot, you might want to sign up for email notifications for PRs and Issues in the github fastai projects so that you will get notified when new entries are submitted and you can also see a previews of PR/Issues in the notification emails. And also we have a commit diff mailing list if you prefer to watch diffs emails instead of using github: https://docs.fast.ai/dev/develop.html#full-diffs-mailing-list

3 Likes

Case study: Writing a new unit test and a doc entry for image resize functionality

Recently, I was doing some training setup that involved variable sized images and got stumbled with it not working. I was only able to find examples here and there and even forums weren’t helpful. So since I needed this problem to be solved, I decided to first write a few simple tests so that I could report the bug, and have it resolved. I submitted a similar bug report earlier, but @sgugger couldn’t find what the problem was without me giving him some reproducible code to work with, which I initially failed to provide.

Part 1: Writing the test

footnote: In case you don’t know in the fastai test suite we use small subsets of real datasets, so that the test execution completes within seconds and not hours. These are the datasets that have _TINY in their name, so as of this writing in fastai/datasets.py you will find: COCO_TINY MNIST_TINY MNIST_VAR_SIZE_TINY PLANET_TINY CAMVID_TINY - these are the ones you want to use for testing.

Apparently everything worked just fine as long as transforms were involved, but without transforms it’d just break. And then I still wasn’t very clear on why some examples used the datablock api whereas others factory methods, yet, working with the same dataset. It was quite confusing.

So I started with a simple test running with a fixed size dataset that I knew will work, since I pretty much copied an existing working test, and added some extra verifications to it that weren’t there originally.

    from fastai.vision import *
    path = untar_data(URLs.MNIST_TINY) # 28x28 images
    fnames = get_files(path/'train', recurse=True)
    pat = r'/([^/]+)\/\d+.png$'
    size=14
    data = ImageDataBunch.from_name_re(p, fnames, pat, size=size)

    x,_ = data.train_ds[0]
    size_want = size
    size_real = x.size
    assert size_want == size_real, f"size mismatch after resize {size} expected {size_want}, got {size_real}"

and it worked.

In this test, I setup the data object just like it’s done in the first lessons of the fastai course, and then I check the size of the first object of the train dataset and check that it indeed got resized. I hope you’re with me so far.

The assert does the checking and the last part of the assert is setup to give me a meaningful debug information in the case of the failure. You will see later how it becomes useful.

So this was my baseline and then I could start doing experiments with it by changing things around.

Next, I pretty much did the same thing, but with a variable image size dataset:

    path = untar_data(URLs.MNIST_VAR_SIZE_TINY)

and it worked too.

Then I replaced the factory method from_name_re:

    data = ImageDataBunch.from_name_re(p, fnames, pat, size=size)

with the data block API:

    data = (ImageItemList.from_folder(p)
            .no_split()
            .label_from_folder()
            .transform(size=size)
            .databunch(bs=2)
            )

and it worked with the fixed images dataset, but it failed with the variable size images dataset.

So I submitted a bug report and someone else did a similar one and he had a great test case that reproduced the problem, and meanwhile I decided to expand the test to cover all the various sizes - int, square and non-square tuples, resize methods and types of datasets. First I did it separately for each way of doing it and then started to slowly refactor it to avoid duplicated code. (duplicated code often leads to bugs.)

After many iterations (many of which were just broken), the many tests morphed into a complete unit test that covered 18 different configuration permutations and did it in both possible ways of performing a resize - (1) with the factory method and (2) data block API. Here it is:

# this is a segment of tests/test_vision_data.py
from fastai.vision import *
from utils.text import *

rms = ['PAD', 'CROP', 'SQUISH']

def check_resized(data, size, args):
    x,_ = data.train_ds[0]
    size_want = (size, size) if isinstance(size, int) else size
    size_real = x.size
    assert size_want == size_real, f"[{args}]: size mismatch after resize {size} expected {size_want}, got {size_real}"

def test_image_resize(path, path_var_size):
    # in this test the 2 datasets are:
    # (1) 28x28,
    # (2) var-size but larger than 28x28,
    # and the resizes are always less than 28x28, so it always tests a real resize
    for p in [path, path_var_size]: # identical + var sized inputs
        fnames = get_files(p/'train', recurse=True)
        pat = r'/([^/]+)\/\d+.png$'
        for size in [14, (14,14), (14,20)]:
            for rm_name in rms:
                rm = getattr(ResizeMethod, rm_name)
                args = f"path={p}, size={size}, resize_method={rm_name}"

                # resize the factory method way
                with CaptureStderr() as cs:
                    data = ImageDataBunch.from_name_re(p, fnames, pat, ds_tfms=None, size=size, resize_method=rm)
                assert len(cs.err)==0, f"[{args}]: got collate_fn warning {cs.err}"
                check_resized(data, size, args)

                # resize the data block way
                with CaptureStderr() as cs:
                    data = (ImageItemList.from_folder(p)
                            .no_split()
                            .label_from_folder()
                            .transform(size=size, resize_method=rm)
                            .databunch(bs=2)
                            )
                assert len(cs.err)==0, f"[{args}]: got collate_fn warning {cs.err}"
                check_resized(data, size, args)

It may look complicated, but it’s very very simple - it does exactly the same simple things I described at the beginning of this post, just tests them in 18 different ways, via 3 loops! Remember, it was written in stages and slowly improved upon.

The only new thing that I haven’t covered so far is the CaptureStderr context manager, that we have in our test utils, which helps us test whether fastai emitted some warnings, which most of the time indicates that there is a problem waiting to happen. Therefore, the test needs to make sure our data is setup correctly and doesn’t emit any warnings (this check is done by a function called sanity_check()). So this check performs the validation that nothing was sent to stderr:

  assert len(cs.err)==0, f"[{args}]: got collate_fn warning {cs.err}"

You can do the same using pytest’s capsys method, but ours is a better one, because it’s a context manager, and as such it’s more of a “scalpel”, whereas capsys is a bit of a “hammer” - when it comes to localization of the stderr capturing.

I submitted the test that included also:

@pytest.mark.skip(reason="needs fixing")
def test_image_resize(path, path_var_size):
...

because it was failing. So then we know the that this test needs fixing, and it doesn’t affect our CI (Continuous Integration) checks.

The next morning, Sylvain fixed the bug, removed the test skip directive and voila - we now have the resize without transforms covered 100% and it will never break in future released versions, because the test suite will defend against it.

You can see this test as it was submitted here.

If you want to run this test, you’d just do:

pytest -sv -k test_image_resize tests/test_vision_data.py

And in case you didn’t know - we have a testing guide, which is full of useful notes.

Part 2: Writing the resize documentation

Now I will tell you a big secret. The main reason I write documentation is primarily for self-serving reasons. I’m a lazy person and I don’t like figuring things out all the time. I enjoy the process of figuring out something once, but repeating the same figuring out is just exhausting. Therefore I tend to write down everything I think I might use again in the future. That’s why I write a lot of docs. I am happy to share them with anybody who wants them, but their main use is for myself. Making it public also ensures that if I lose my copy, I can restore it later from the public copy.

Same goes for tests, I write tests so that I don’t need to figure out why my code stopped working when a new release that included some change in the fastai library broke previously working functionality. By writing tests I ensure a future peace of mind for myself. And others benefiting from it is a nice side effect. And my ego is happy!

So back to our case study, now that I wrote this test I knew everything I needed to know about the resize functionality in the fastai library (the user side of it). And since there is so much complex ever changing tech stuff I need to cope with on the daily basis, I know I will forget this hard earned knowledge, so I decided that it’ll pay off to invest a bit more time to write a summary of what I have learned.

And so I did, and now there is a new entry that documents all the possible ways you could resize images in fastai: https://docs.fast.ai/vision.transform.html#resize

It’s literally the same as the test that I wrote, except it’s done in words and organized for an easier understanding.

Then I realized that resizing images on the fly is very inefficient if you have a lot of them and they are large. Therefore I expanded that section explaining how to resize images before doing the training, stealing one example from Jeremy’s class notebook, so there was an example in python and sharing the command line way using imagemagick method I normally use. (and that part of the doc entry could use more examples and more ways of doing that including pros and cons of those different ways. hint, hint).

Conclusion

So you can now tell that most likely before you write documentation you need to understand how the API or a use case you are about to document works, and since the only way to really understand something is by using it you will have to write some code. And if you’re going to write some code anyway, why not write a unit test for the fastai library.

If it’s a use case, involving some specific actions then a small or large tutorial on the steps to repeat your process is called for.

5 Likes

Both of these top posts are up-to-date, i.e. they get updated as things get completed or new needs arise.

this is the idea that you don’t need to read the thread, just the top-post. The thread is long because it includes the discussions on all those topics that are then summarized in the top post.

Perhaps we should apply this split method to all threads, one thread with just the up-to-date summary and another for discussion, because most users don’t realize that in those few threads the first post is special and is not just a starter for the 800-post thread.

Great! I’ll use them accordingly going forward.

Is there a way to pin a post to the top of a thread? I know if the first post is a wiki, we can update it with relevant content but it might be useful to, for example, pin your case studies to the top of this thread. So the first five posts always contain all the information anyone needs. The up voting to summarise route doesn’t work unless there is enough people liking posts.

I think I posted enough by now and it’s your turn, so that next we can put all the case studies into the summary thread. I won’t put mine there until you post yours! :slight_smile:

I don’t think having one person posting all the case studies would make for an inspiring case, since a reader can always find an excuse that she is not wired that way and there is no way he can do something similar - I know it can be intimidating. I know I’m special. And each one of you is. But not all of you choose to believe this is the case.

This is the intention behind these sharings, to show others that each person has something great to contribute. Someone writes notes, someone helps with new users, another person loves trying outrageous things, yet another has 20 GPU cards and they want to finish the training of cats vs. dogs in 5msecs, etc. Which one is you?

Just discovered this inspiring sharing by Sylvain:
https://www.fast.ai/2019/01/02/one-year-of-deep-learning/

2 Likes

I had to research this, I’m new to discourse. There might be better ways.

From reading through the discource forums, it looks like there are two ways to re-order posts:

  1. move all the posts you want not to be at the top to a new temporary topic, which will push the remaining posts up and then move the moved posts back. and it says you have to add a new post first… very hackish but doable.

  2. some kind of posts:reorder_posts rake task https://meta.discourse.org/t/a-way-to-reorder-posts/31532/14 - I have no idea what it means.

FYI Added to low hanging fruit:

If you have any questions or issues please post them in the corresponding thread: Documentation improvements Thank you.

1 Like

A post was merged into an existing topic: Documentation improvements

Case Study: helping with Test coverage

I write up a little case study following @stas example to motivate others to contribute, especially helping with tests.

My background is currently spend around half of my time coding for a living, I consider myself a good, not fantastic coder. I started DL a year ago and my python knowlegde was mostly non-existant even though I know all the concepts from education and other languages. Python is a lot of fun for me, so many cool libraries and that is already a gain for me. I am also co-organising a fast.ai meetup in the hague, Netherlands

I chose testing, for some reasons. One, we get this MOOC for free, crazy right? So why not try to give back a little. Of course, I do it for myself. It’s cool to have commits in a project like fast.ai and testing is a safe bet to find easy tasks that are useful for the project. I consider myself nowhere near to make a substantial PR, maybe could fix some bugs though meanwhile and write tests for them. Tests are just useful for projects to guarantee stable code, they can be sometimes a little drag but they give you a good learning curve and hopefully one day you can get to other areas.

There is a couple of gotchas when you contribute:

  • Contributions might require more time than you expected. So if you work full time, have a private life you might get a little under pressure for a while. It’s not nice starting a project with others and then letting forum posts unanswered, so if you start try to pull through. And be realistic what you can do and be open about it.

  • It is also easy to come up with all kinds of great ideas on features while not having finished the simpler tasks nor having thought through if such PRs are possible and you find the time to do it. So probably good to stick to one really simple task and take it step by step

  • Then, this is meant to be a contribution not a training camp for you. While of course, you will and should learn, am personally quite concerned to cause the maintainers like Stas and Sylvain not more work than value they get back. So, I do my best and probably in the future will even be more careful to what tasks I contribute. Need to be confident I can deliver reasonable fast & smooth.

Overall, this is fantastic stuff and here are the benefits for you.

  • For example, I wrote a little test for fit and fit_one_cycle. Helps to me really understand what’s going on now - and I am now the guy who tests this cool stuff from Sylvain and Lesley? Cool … put it on my CV :slight_smile: Isn’t that much cooler than making another DL cert on some MOOC?

  • I also learnt quite a bit how to automate this whole git process with scripts and python, sth I can directly apply in my professional life.

  • Stas and Sylvain are really great coders, Jeremy gave the code a lot of thought and has all kinds of ideas why eg. functions are written in certain ways. You can just learn with the masters (I get a lot out of listening to Stas reviews!) and hopefully be useful for a great project. Standards are high and you can grow with it.

Go for it, am actually sure most reading this have more time & python skills than I have so it should be totally possible to contribute !

And here 2 concrete ways 2 contribute:

  • You can find ways to contribute tests here: https://forums.fast.ai/t/improving-expanding-functional-tests. Just pic an easy part, there are easy tests, some might be more difficult.
  • Meanwhile, we have a little spin off to automate docs based on tests: Doc_test project .For that part soon, we will need to go through all tests and register then correctly with a function this_tests. The idea is we show the tests to the notebook user and in automated docs. Just pick a test class and place the this_tests in there, when we are ready. As part of this spin off, we need to register existing tests with this_tests to integrate those into a Documentation.
6 Likes

Great write up, @Benudek. Thank you for sharing your story and insights.

May I just suggest that the intention behind inviting the stories is not to motivate anybody. We are not a corporation that needs to motivate its employees. I believe people who come to this thread are already motivated to give back after receiving so much. They just don’t know how and/or are uncertain that they can. Therefore, I feel, the problem is not a lack of motivation, but a need to demonstrate that the contributing can be done on different levels and by users with a very wide range of skills and experience, leading to wide range of contributions from submitting a broken link or a typo fix to a complicated algorithm PRs, overall all being equably important, because it’s the whole that makes a great product and not just its parts.

And to clarify, I’m not asking you to edit your story, just wanting to make a stress on why we are inviting users to share these stories.

4 Likes

Here is an inspiring personal sharing on figuring out how to contribute to an open-source project by Vishwak Srinivasan of the pytorch dev community.

4 Likes

A post was merged into an existing topic: Fastai-nbstripout: stripping notebook outputs and metadata for git storage

Hi Stas,

I’ve been watching the Deep Learning course 2019 and was inspired by the example Jeromy gave of the gentleman that contributed with code. Specially because I’ve been working on something that one of the students asked which was to automatically find the learning rate. I developed an AutoML hyperparameter algo that has been beating random search, and skopt’s Baysian Optimization. I’m using Pattern Search which is MATLAB’s method for finding global optima (my code is in Python 3.6 and I have adapted it to work in the SKLearn framework, so you can call the code just like you call RandomizedSearchCV). Is there anyone that I could show the notebook, to see if it is something interesting to incorporate into Fast.ai?.

Best regards,

Rodrigo.