How to contribute to fastai [Discussion]

This topic is for discussing How to contribute to fastai.

We felt that the main how-to thread should have no posts, so that it won’t look intimidating as some people look at the number of posts and skip long topics. So there will be a few posts there. But you can have as many posts as desired in this topic.

Your questions and answers will then make it back as an FAQ or notes in the post above.

Do note that this post and the summary post are both wiki, which means you can all edit them to improve their quality!

and thank you!

11 Likes

I started working on the PR submission guidelines here, if you have other suggestions, especially based on your experience of submitting PRs, please contribute: https://github.com/fastai/fastai/blob/master/CONTRIBUTING.md#pr-submission-guidelines

You can just make comments/suggestions here and I will add them to the document.
Thank you.

2 Likes

I don’t know if my inputs help at this point, but Pierre’s post resonated so I’m sharing my experience for what its worth. For the past month or so I have repeatedly visited the dev projects pages intending to get involved; spent time exploring some of the tasks listed, gotten intimidated by the long posts and razor speeds at which threads evolve out here and gone away feeling overwhelmed at not making any progress.

It seems like there is a certain threshold at which point you’re in; you have the necessary background and skills to contribute and the momentum and time required to do so.

If there is a way for beginners to get involved in part-time mode, I’d love to jump in.

I’ve specifically looked at (but not yet contributed to):

  1. documentation updates, specifically for basic_data and datasets
  2. doing some sort of extraction tool (perhaps even an NLP model) to suck out the hidden gems of information in the forum, inspired by this post by @stas.

PS: I work well in teams, struggle to go it alone. So if anyone wants to pair up, and is ok with a part-timer, please ping me! Thanks!

Meanwhile, I will continue to look out for small things I can do. Maybe the low hanging fruits.

2 Likes

that sounds like me :wink: Its totally possible - just be careful the python stuff is fun, so you might stop watching netflix and need to discipline yourself to not drop your day job :slight_smile:

1 Like

I moved your post to this topic where it fits the best, hope you don’t mind :wink:

welcome, and I trust you will find a way to contribute and help others to do so, @nbharatula.

It seems like there is a certain threshold at which point you’re in; you have the necessary background and skills to contribute and the momentum and time required to do so.

this is what this topic is about - we are sorting it out and your help is crucial.

  1. doing some sort of extraction tool (perhaps even an NLP model) to suck out the hidden gems of information in the forum, inspired by this post by @stas.

thank you for reminding me about this one - it’s an excellent one - it’s already on the list at How to contribute to fastai wiki.

PS: I work well in teams, struggle to go it alone. So if anyone wants to pair up, and is ok with a part-timer, please ping me!

excellent call!

You can also use https://forums.fast.ai/c/fastai-users/dev-projects to self-organize. Just like folks organize themselves into study groups, you could do the same for contributing groups.

If you need any tools, sub-forums or similar please ask.

1 Like

Here are some more simple to medium complexity items to consider:

  • Another way to create document contributions is to first compile them on forums via collaboration in a wiki post, and once you as a group are happy with the outcome, so that it flows well and more or less readable and to the best of your knowledge correct, only then one of you submits a PR. and we would be happy to give credits to all who participated in the commit comment/CHANGES - you just need to provide such a list.

  • Yet another way, is you reading the docs, finding something that’s vague/unclear/not easily understood, asking about it at the forums, hopefully getting answered, then with the new understanding sending a PR improving that doc.

  • Also many docs are dry and lacking examples, especially in the API sections. So once you understand the topic consider contributing an example or a little tutorial.

  • Watching issue reports on github and the discussion around them, and documenting what’s being said, suggested. For example, very often people submit an Issue, just to receive - not a bug, you should be doing this instead - that usually means that either the user didn’t read the doc, or couldn’t find the doc, or the doc is unclear or it doesn’t exist. This is your low-hanging fruit to contribute back. put yourself in the shoes of the person who filed the issue and now that you know the answer, find a way to document that in the right intuitive place so that future users will not encounter that issue.

  • Often adding cross-links from different parts of documentation is very very helpful. e.g. for details on how to do foo, see bar.

3 Likes

Thanks for moving me over. This seems the right thread for my musings! :slight_smile:

Maybe I’m old school but I feel the forum isn’t organised very well. By that I don’t mean the topics/categories (though that too), but mostly in its ability to break up information based on your stage of learning. You have to parse a lot before you can determine what is useful and applicable to you. And as a newcomer you don’t even have the skills to determine what is useful or applicable, so you’re easily lost.

That’s why I think an NLP model to parse the entire forum and categorise all threads into beginner, better, best, bored levels would be super useful! I just don’t know how to go about doing this!

2 Likes

Ideally, this should be the responsibility of the documentation, and not forums. So if there is a set of docs that caters to different levels of users then you don’t need to worry about forums at all.

For example now we have:

So if you were to ignore forums and start with docs, and only use forums when the docs don’t cover something (or you’re following lessons), then you have all your bases covered.

Of course, we need a lot more tutorials and better api docs, but the structure is there. It just needs to be filled out more.

The problem with forums is that when you have thousands of users saying anything they would like to say it’s very difficult to create something that is of value to a general user. And it’s very overwhelming when you have threads with 1000 posts - ouch!

That’s why curating content makes a huge difference to the quality of user experience, and that is why forums should be considered to be a sort of playground discussion and the important outcomes should be summarized and placed in intuitive places of the documentation.

If you keep that in mind and focus on better docs, then forums stop being an issue. Does it make sense?

3 Likes

I would be glad to work with you on both of these topics :slight_smile: I must admit I never really got my hand dirty in NLP until now (I focused on vision so far) but I’ve been meaning to do so.

Do you find the post linked by Stas in the first message of this post to be useful, or is it stil too long and intimidating ? If so, do you have any suggestions ?

I have a feeling what would be helpful is to start collecting various case studies of how we have been contributing to the fastai project, in particular from the earlier days of our involvement, when we were in the same shoes of the new comers and often, green newbies.

This is not a competition, but I suggest that readable, instructive and inspiring case studies will be moved to the summary thread: How to contribute to fastai so that users won’t need to hunt for them. Remember that the intention is not to trump your horn, but provide inspiration bites for others to follow and realize that are much more capable to contribute than they believe themselves to be.

I will post a few case studies of my own, and please share a few of yours each. Thank you.

1 Like

The “How to contribute” wiki is great. Thanks for writing it! It’s long but the highlighted portions help navigate it.

It’s missing an easy way to see “what’s currently tbd/open”. You may want to link the Dev Projects Index post. Though I’m not sure if that’s up to date either!

Likewise with the documentation post, it’s not clear from the first post how many of those topics still need work and how many are done - I had to go through the entire post and still wasn’t sure, so started looking at the documentation and comparing with the codebase and got terribly lost somewhere in that process.

Oh and yes, I’d love to collaborate - will DM.

Case study: Bringing outside domain expertise

I will share a few examples of how I was able to contribute to the fastai project at the very beginning of my involvement when I couldn’t help directly with neither code nor docs, since I didn’t have any relevant expertise in either.

A magical release process

When I decided I wanted to contribute to the fastai project in the fall of 2018, I have just started learning python and knew almost nothing about ML/DL, but I did have many years of experience in other domains and in particular full time involvement with open source projects. So I asked what I can help with and was suggested that the fastai project needed a process to make releases.

I said, “great, I know a little bit of bash and make” and so slowly slowly I started creating various make targets, each doing a single task and documenting in what order they needed to be run. Being a bit of a perfectionist I didn’t like the process of copy-n-pasting different stages and typing extra commands as it was very error prone and would make a release manager think hard before agreeing to make a new release. So I had to work much harder to make fool-proof make targets, that depend on each other and validate each other, and after a few weeks of thinking an experimenting the magic release process was born. Now when the release manager needs to make a new release all they need to do is to type:

make release

and the system will do many checks, install required component, test that everything is committed, create a release branch, update CHANGES file, bump the version, switch back and force between the release and master branches, commit the right things to the right branches, build conda and pip packages, upload them to pypi and anaconda servers, wait till the servers make the new version available, test that they need to install it and finally switch back to the master branch.

And of course, each step is documented in https://docs.fast.ai/dev/release.html

Out of the box solution to version number maintenance

While building the release process, I had to solve the version maintenance. I looked at the solutions used by other projects and they looked clunky, requiring complex dependencies, and overall not satisfactory for an extremely trivial problem of moving from 1.0.42 to 1.0.43, plus supporting 1.0.43.dev0 dev versions.

I first thought that it’d be good for me to learn how to do that in python, but then I quickly discarded that since python is not a one-liner friendly language, and Makefile works the best with unix tools like bash, awk, sed, and since for many years I have been programming in Perl and perl is installed pretty much everywhere on unix system, I said, hey why not do a quick solution in perl. So I did:

perl -pi -e 's|((\d+)\.(\d+).(\d+)(\.\w+\d+)?)|$o=$1; $n=$5 ? join(".", $2, $3, $4) :join(".", $2, $3, $4+1); print STDERR "Changing version: $o => $n\n"; $n |e' version.py

and another variation for adding .dev0 and the problem was solved. Now you just need to say: make bump and the above command does it all and even tells what it did.

(and yes, it can be further simplified, but it’s written that way because since then I added bump-minor, bump-major, etc. targets, so it’s easier when it’s broken down into elements - see https://docs.fast.ai/dev/release.html#version-bumping)

Our version number is just a string that we bump, but can also be changed manually if need be. And most importantly there is only one place where it’s set and the rest of the system, takes cues from that file no matter in what form and shape it’s then used and it’s used in quite a few places.

While it’s possible that down the road this solution will be replaced with something else (say if perl will be removed from common unices), it has been worked perfectly fine so far.

Simplifying the PR making branch

git is a very hard tool to figure out. One can learn to pull/commit/push relatively quickly, but anything beyond can be quite a journey for most people new to it. So understanding git is always a big hurdle for making the PR process easier. I tend to create cheat sheet files about everything I learn, and so one of these files was about making a PR with lots of copy-n-paste instructions with comments on what they do.

Being a total git newbie most of the time I was just copying the instructions and was finding it very frustrating, especially when something wasn’t going right (e.g. when my master fork wasn’t synced with the main master).

So I said let’s automate it and let a program figure out whether we need to fork and/or update the forked master, how to setup the upstream and a whole bunch of other boring technical details that nobody really wants to think about when all they want to do is to contribute a one character typo fixing PR… or, perhaps something more serious.

So I spent a lot of time reading stackoverflow posts and was able to turn my cheat sheet into a simple bash script. Now if you want to make a PR branch no matter how big or small it is and no matter whether you’re a git expert or someone who is totally new to it, you just need to type:

fastai-make-pr-branch.py ssh stas00 fastai fastai typo-fix

and that’s it, you now just need to apply your fix, commit and push and you’re done. It’s documented here.

Moreover @devforfu has been working to port this bash script to python so that windows users without bash can benefit from it too. And here you can see delegation at work - Ilia is much more experienced in python and he would do a better and faster job doing the porting.

Also sharing all kinds of related notes often helps, this file was primary my cheat sheet: https://docs.fast.ai/dev/git.html, and then someone said, hey dude, you should share it, so I did. So if you have things you find useful for yourself share them and you’d be surprised that a lot of other people would benefit from it.

Conclusion

Often, when you are just starting with a new project in a new language or domain, you most likely still have a good know-how from your previous projects that that could of great use. Just think out of the box and remember that it’s ok if your solution is not perfect, or not following the accepted norms, or in the wrong language - that’s a start, you can always improve things afterwards, the important thing is to get things done, so that you feel good and you helped someone to save time.

7 Likes

Case study: Helping out Sylvain with PRs and Issues

I’ll be honest with you, as of today, watching Sylvain’s bug fixing commits, most of the time I have no idea what he’s doing. It’s all a voodoo stuff which one day I hope to be able to follow and code myself.

But I thought, perhaps, there are some things that he does in his handling of PRs and Issues that I could do instead of him to free up his time, so that he will have more time creating an even more amazing fastai library.

It didn’t take long to find. Perhaps, you don’t know that, but all contributions to the fastai projects require contributors to sign a Contributor License Agreement, and so, often, a new contributor, unaware of this requirement, submits a PR which we can’t accept until the CLA is signed. And usually you’d see the following at the bottom of a PR page:

And then Sylvain would comment:

Please sign this CLA agreement Voir anime Rick et Morty as explained here before we can proceed. Thank you.

And I said, I can do that too! And so can you. And so unless someone beats me to it, I just do it, it doesn’t save a great amount of time to Sylvain, but it all adds up and it also speeds up the submission to merging process, since the contributor might take some hours and sometimes days to comply (as often they have to ask permission from their employers).

Of course, typing that reply every time would be time wasteful, so github has reply templates. Unfortunately, they have to be configured manually by each github user and there is no way to have them pre-set per project. If you’d like to use our templates, it’s all documented here: https://github.com/fastai/fastai/blob/master/.github/issue_reply_templates.md

And once you configured them, here is how you use them. You just click on the right upper corner of the reply box and you get a dropdown with the pre-made replies. Then just pick the one you need, hit [Comment] and you’re done:

Even a total beginner in fastai can do that.

Among the reply templates I linked to above, you will find a few other templates that you can also use to help the maintainers:

  • We don’t deal with install issues in the github Issues but have dedicated forum threads for that (since most issues have already been resolved and discussed in those threads, so the solution is most likely already there). so when someone posts an install question, we just reply with one of the following 2 replies and close the Issue.

    • fastai: install issues [v1]

      fastai 1.0.x installation issues should be reported/discussed here instead. Thank you.

    • fastai: install issues [v0]

      fastai 0.7.x installation issues should be reported/discussed here instead. Thank you.

  • and then we have PRs from contributors who either haven’t read the steps for setting up the fastai repo or they forgot to do it, so when that happens the CI will report a [fastai.fastai (nbstripout_config)] failure. In which case I reply with this template (as so can you):

    • fastai: unstripped notebook

      If your PR involves jupyter notebooks ( .ipynb ) you must instrument your git to nbstripout the notebooks, as explained here. PRs with unstripped out notebooks cannot be accepted.

So here you go, you now have at least 3 ways you can help the maintainers, PR contributors and Issue submitters w/o knowing much about fastai.

If you observe maintainers at work you will notice other little things that you could help with. Just watch the process and then see if you can save them time, by taking over some activities that you understand and feel comfortable at running with. Do not be afraid to make a mistake, it’ll all get sorted out if a mistake happens.

BTW, to save yourself time and not need to click around github a lot, you might want to sign up for email notifications for PRs and Issues in the github fastai projects so that you will get notified when new entries are submitted and you can also see a previews of PR/Issues in the notification emails. And also we have a commit diff mailing list if you prefer to watch diffs emails instead of using github: Notes For Developers – fastai

3 Likes

Case study: Writing a new unit test and a doc entry for image resize functionality

Recently, I was doing some training setup that involved variable sized images and got stumbled with it not working. I was only able to find examples here and there and even forums weren’t helpful. So since I needed this problem to be solved, I decided to first write a few simple tests so that I could report the bug, and have it resolved. I submitted a similar bug report earlier, but @sgugger couldn’t find what the problem was without me giving him some reproducible code to work with, which I initially failed to provide.

Part 1: Writing the test

footnote: In case you don’t know in the fastai test suite we use small subsets of real datasets, so that the test execution completes within seconds and not hours. These are the datasets that have _TINY in their name, so as of this writing in fastai/datasets.py you will find: COCO_TINY MNIST_TINY MNIST_VAR_SIZE_TINY PLANET_TINY CAMVID_TINY - these are the ones you want to use for testing.

Apparently everything worked just fine as long as transforms were involved, but without transforms it’d just break. And then I still wasn’t very clear on why some examples used the datablock api whereas others factory methods, yet, working with the same dataset. It was quite confusing.

So I started with a simple test running with a fixed size dataset that I knew will work, since I pretty much copied an existing working test, and added some extra verifications to it that weren’t there originally.

    from fastai.vision import *
    path = untar_data(URLs.MNIST_TINY) # 28x28 images
    fnames = get_files(path/'train', recurse=True)
    pat = r'/([^/]+)\/\d+.png$'
    size=14
    data = ImageDataBunch.from_name_re(p, fnames, pat, size=size)

    x,_ = data.train_ds[0]
    size_want = size
    size_real = x.size
    assert size_want == size_real, f"size mismatch after resize {size} expected {size_want}, got {size_real}"

and it worked.

In this test, I setup the data object just like it’s done in the first lessons of the fastai course, and then I check the size of the first object of the train dataset and check that it indeed got resized. I hope you’re with me so far.

The assert does the checking and the last part of the assert is setup to give me a meaningful debug information in the case of the failure. You will see later how it becomes useful.

So this was my baseline and then I could start doing experiments with it by changing things around.

Next, I pretty much did the same thing, but with a variable image size dataset:

    path = untar_data(URLs.MNIST_VAR_SIZE_TINY)

and it worked too.

Then I replaced the factory method from_name_re:

    data = ImageDataBunch.from_name_re(p, fnames, pat, size=size)

with the data block API:

    data = (ImageItemList.from_folder(p)
            .no_split()
            .label_from_folder()
            .transform(size=size)
            .databunch(bs=2)
            )

and it worked with the fixed images dataset, but it failed with the variable size images dataset.

So I submitted a bug report and someone else did a similar one and he had a great test case that reproduced the problem, and meanwhile I decided to expand the test to cover all the various sizes - int, square and non-square tuples, resize methods and types of datasets. First I did it separately for each way of doing it and then started to slowly refactor it to avoid duplicated code. (duplicated code often leads to bugs.)

After many iterations (many of which were just broken), the many tests morphed into a complete unit test that covered 18 different configuration permutations and did it in both possible ways of performing a resize - (1) with the factory method and (2) data block API. Here it is:

# this is a segment of tests/test_vision_data.py
from fastai.vision import *
from utils.text import *

rms = ['PAD', 'CROP', 'SQUISH']

def check_resized(data, size, args):
    x,_ = data.train_ds[0]
    size_want = (size, size) if isinstance(size, int) else size
    size_real = x.size
    assert size_want == size_real, f"[{args}]: size mismatch after resize {size} expected {size_want}, got {size_real}"

def test_image_resize(path, path_var_size):
    # in this test the 2 datasets are:
    # (1) 28x28,
    # (2) var-size but larger than 28x28,
    # and the resizes are always less than 28x28, so it always tests a real resize
    for p in [path, path_var_size]: # identical + var sized inputs
        fnames = get_files(p/'train', recurse=True)
        pat = r'/([^/]+)\/\d+.png$'
        for size in [14, (14,14), (14,20)]:
            for rm_name in rms:
                rm = getattr(ResizeMethod, rm_name)
                args = f"path={p}, size={size}, resize_method={rm_name}"

                # resize the factory method way
                with CaptureStderr() as cs:
                    data = ImageDataBunch.from_name_re(p, fnames, pat, ds_tfms=None, size=size, resize_method=rm)
                assert len(cs.err)==0, f"[{args}]: got collate_fn warning {cs.err}"
                check_resized(data, size, args)

                # resize the data block way
                with CaptureStderr() as cs:
                    data = (ImageItemList.from_folder(p)
                            .no_split()
                            .label_from_folder()
                            .transform(size=size, resize_method=rm)
                            .databunch(bs=2)
                            )
                assert len(cs.err)==0, f"[{args}]: got collate_fn warning {cs.err}"
                check_resized(data, size, args)

It may look complicated, but it’s very very simple - it does exactly the same simple things I described at the beginning of this post, just tests them in 18 different ways, via 3 loops! Remember, it was written in stages and slowly improved upon.

The only new thing that I haven’t covered so far is the CaptureStderr context manager, that we have in our test utils, which helps us test whether fastai emitted some warnings, which most of the time indicates that there is a problem waiting to happen. Therefore, the test needs to make sure our data is setup correctly and doesn’t emit any warnings (this check is done by a function called sanity_check()). So this check performs the validation that nothing was sent to stderr:

  assert len(cs.err)==0, f"[{args}]: got collate_fn warning {cs.err}"

You can do the same using pytest’s capsys method, but ours is a better one, because it’s a context manager, and as such it’s more of a “scalpel”, whereas capsys is a bit of a “hammer” - when it comes to localization of the stderr capturing.

I submitted the test that included also:

@pytest.mark.skip(reason="needs fixing")
def test_image_resize(path, path_var_size):
...

because it was failing. So then we know the that this test needs fixing, and it doesn’t affect our CI (Continuous Integration) checks.

The next morning, Sylvain fixed the bug, removed the test skip directive and voila - we now have the resize without transforms covered 100% and it will never break in future released versions, because the test suite will defend against it.

You can see this test as it was submitted here.

If you want to run this test, you’d just do:

pytest -sv -k test_image_resize tests/test_vision_data.py

And in case you didn’t know - we have a testing guide, which is full of useful notes.

Part 2: Writing the resize documentation

Now I will tell you a big secret. The main reason I write documentation is primarily for self-serving reasons. I’m a lazy person and I don’t like figuring things out all the time. I enjoy the process of figuring out something once, but repeating the same figuring out is just exhausting. Therefore I tend to write down everything I think I might use again in the future. That’s why I write a lot of docs. I am happy to share them with anybody who wants them, but their main use is for myself. Making it public also ensures that if I lose my copy, I can restore it later from the public copy.

Same goes for tests, I write tests so that I don’t need to figure out why my code stopped working when a new release that included some change in the fastai library broke previously working functionality. By writing tests I ensure a future peace of mind for myself. And others benefiting from it is a nice side effect. And my ego is happy!

So back to our case study, now that I wrote this test I knew everything I needed to know about the resize functionality in the fastai library (the user side of it). And since there is so much complex ever changing tech stuff I need to cope with on the daily basis, I know I will forget this hard earned knowledge, so I decided that it’ll pay off to invest a bit more time to write a summary of what I have learned.

And so I did, and now there is a new entry that documents all the possible ways you could resize images in fastai: https://docs.fast.ai/vision.transform.html#resize

It’s literally the same as the test that I wrote, except it’s done in words and organized for an easier understanding.

Then I realized that resizing images on the fly is very inefficient if you have a lot of them and they are large. Therefore I expanded that section explaining how to resize images before doing the training, stealing one example from Jeremy’s class notebook, so there was an example in python and sharing the command line way using imagemagick method I normally use. (and that part of the doc entry could use more examples and more ways of doing that including pros and cons of those different ways. hint, hint).

Conclusion

So you can now tell that most likely before you write documentation you need to understand how the API or a use case you are about to document works, and since the only way to really understand something is by using it you will have to write some code. And if you’re going to write some code anyway, why not write a unit test for the fastai library.

If it’s a use case, involving some specific actions then a small or large tutorial on the steps to repeat your process is called for.

5 Likes

Both of these top posts are up-to-date, i.e. they get updated as things get completed or new needs arise.

this is the idea that you don’t need to read the thread, just the top-post. The thread is long because it includes the discussions on all those topics that are then summarized in the top post.

Perhaps we should apply this split method to all threads, one thread with just the up-to-date summary and another for discussion, because most users don’t realize that in those few threads the first post is special and is not just a starter for the 800-post thread.

Great! I’ll use them accordingly going forward.

Is there a way to pin a post to the top of a thread? I know if the first post is a wiki, we can update it with relevant content but it might be useful to, for example, pin your case studies to the top of this thread. So the first five posts always contain all the information anyone needs. The up voting to summarise route doesn’t work unless there is enough people liking posts.

I think I posted enough by now and it’s your turn, so that next we can put all the case studies into the summary thread. I won’t put mine there until you post yours! :slight_smile:

I don’t think having one person posting all the case studies would make for an inspiring case, since a reader can always find an excuse that she is not wired that way and there is no way he can do something similar - I know it can be intimidating. I know I’m special. And each one of you is. But not all of you choose to believe this is the case.

This is the intention behind these sharings, to show others that each person has something great to contribute. Someone writes notes, someone helps with new users, another person loves trying outrageous things, yet another has 20 GPU cards and they want to finish the training of cats vs. dogs in 5msecs, etc. Which one is you?

Just discovered this inspiring sharing by Sylvain:
https://www.fast.ai/2019/01/02/one-year-of-deep-learning/

2 Likes

I had to research this, I’m new to discourse. There might be better ways.

From reading through the discource forums, it looks like there are two ways to re-order posts:

  1. move all the posts you want not to be at the top to a new temporary topic, which will push the remaining posts up and then move the moved posts back. and it says you have to add a new post first… very hackish but doable.

  2. some kind of posts:reorder_posts rake task A way to reorder posts? - #14 by marcozambi - feature - Discourse Meta - I have no idea what it means.