Documentation effort

fredguth · July 25, 2018, 1:33pm

I guess most people would agree that fast.ai deserves a better documentation and a lot of people from this forum would benefit from that.

I would like to contribute back to fast.ai helping to build this documentation and I guess other members of this forum may also want to join (as documenting is a great way to learn more about the library).

Is there already an organised documentation effort in place?

sgugger · July 25, 2018, 3:36pm

Since we are in the process of redoing everything (a bit) differently, documentation of the current library is more or less on hold. For fastai_v1 however, we will definitely want something better than what is available right now, with notebooks in Colab or/and Sagemaker so that the user can immediately play around with the parameters.

Nothing has been really written yet, as we are in the first stages of development, but stay tuned! We’ll ask for help with testing and documenting as soon as the core is ready.

radek · July 25, 2018, 4:07pm

I don’t have much experience with collab / sagemaker but that sounds quite cool for people who don’t have their machine set up / are only starting their journey

I was planning on doing docker setups for the new functionality, where you run a docker container and it pulls the data for you and you can open a jupyter notebook and hit run all cells and it works. Anyhow, maybe will go with this idea at least to some extent.

Sucks a bit getting data is so involved. For instance, there is the imagenet data on Kaggle, but you need to log in to pull it I believe or authenticate via the API… Probably for colab / sagemaker the data can be preloaded? Would be nice to have some sort of canonical repository of the datasets used for fastai lectures… CIFAR10, dogsvscats, imdb, VOC pascal, maybe even COCO but I guess probably the legality of this comes into play… I have been downloading things from pjreddie.com - this gentleman is kind enough to host the CIFAR10 and pascal so at least there’s that

jeremy · July 25, 2018, 4:38pm

I suspect that kaggle kernels might be better than colab for this reason - we can have the datasets there already.

As well as the notebooks, there will be complete documentation in regular web pages - and help will certainly be appreciated on this once we’ve got some functionality finalized so it can be documented!

fredguth · July 25, 2018, 4:54pm

Putting documentation on hold makes sense if the api is going to change. Please let us know when is a good time to get involved.

stas · July 26, 2018, 10:26pm

why not put the datasets on http://archive.org/? Free and requires no auth to access. There are hundreds of TBs of data dumps there, e.g.: Internet Archive: Digital Library of Free & Borrowable Books, Movies, Music & Wayback Machine

could create a nice little fastai account and put there all kinds of data goodies.

jeremy · July 27, 2018, 12:56am

Can anyone just upload a dataset there? When I click “upload” it says: “Please contribute books, audio, and video files that you have the right to share.”. It doesn’t mention adding datasets.

A benefit of putting them on kaggle datasets is that kaggle kernels can then access them directly, which really helps accessibility.

stas · July 27, 2018, 2:56am

I looked a bit more, it’s loaded with datasets already: Internet Archive: Digital Library of Free & Borrowable Books, Movies, Music & Wayback Machine
They even have a dedicated collection for datasets: Internet Archive: Digital Library of Free & Borrowable Books, Movies, Music & Wayback Machine so if you tag the newly uploaded dataset with that collection name it’ll end up there.

For example go to:

click on “Show All” in the right column (on wide screen) and you get a direct link to the dataset:

https://archive.org/download/imdb-dataset-2017-10-11/imdb-dataset-2017-10-11.tar.gz

Done.

As long as the dataset is in public domain, or has a license that supports sharing, it should fit just right on archive.org.

As I said earlier, I won’t worry about sizes of datasets - they host some data that’s many many TBs big.

stas · July 27, 2018, 3:00am

http://academictorrents.com/ is another place with 26TB in datasets as of this moment. But I’d trust archive.org much more long term. It’ll be around long after academictorrents will be gone - but it might be good as a backup.

edit: please scratch that suggestion - it won’t work because it provides no direct links, and requires torrent/magnet downloads.

msp · July 27, 2018, 1:57pm

Interesting, why is that? I have known their “wayback machine” for a long time, but wouldn’t bet they are willing to host many large datasets (storage and bandwidth can be quite costly).

stas · July 27, 2018, 5:07pm

I meant that between a small private endeavor and a huge established archive.org, the latter has a better chance to survive.

Plus, you can see that it already supports datasets as shown in my earlier post here.

According to its wiki: “As of October 2016, its collection topped 15 petabytes.” So what’s another few TBs

That’s why I thought it would be worth trying.

msp · July 29, 2018, 12:23pm

Yes that’s a good point, although their primary mission is the archival of the web through continued crawling. If people start using it as a means of sharing large datasets, my intuition is that when they need to reduce costs, this is the first thing they will cut.

vha14 · August 1, 2018, 6:21pm

How about using unit tests (using Pytest for example) as a way to gently start the documentation effort? When refactoring a function such as compose from a notebook to a module, it may be only a small effort to write a unit test that gives users a sense of what compose does.

jeremy · August 1, 2018, 6:39pm

We’ve decided today we’re going to have a go at using jupyter notebooks as the source of our docs, and some custom conversion tools to turn them into proper hyperlinked documentation. More info coming soon!

stas · August 28, 2018, 9:17pm

update: as of Aug 12, many of the datasets we use are already on archive.org, e.g.: cifar-10, whose entry page is here. And many other datasets are here.

When you go to an entry page like cifar10, click on Show All to get the complete file listing. And that’s where you get the direct link to the dataset archive, which should stay the same for all.

Brennon · January 21, 2019, 1:26pm

IMHO, documentation by example is very useful, but at the end of the day, there is no substitute for thorough API documentation itself. Especially when an API relies so much on indirection (e.g., most of the magic in method X is done by these four public methods that receive their arguments through additional arguments passed to method X), not having actual documentation can make using the library very difficult.

As a case in point, I was working a couple of weeks ago with the data block API. The narrative introduction to that section of the docs reads:

The data block API is called as such because you can mix and match each one of those blocks with the others, allowing for a total flexibility to create your customized DataBunch for training, validation and testing.

This (justifiably, I believe) led me to believe that this methods in this API were chainable, as a number of the examples demonstrate. The problem is, some of those methods are chainable, and some aren’t. The only way to understand what can be chained with what is by knowing on what class a given method is defined, and by knowing the return values of a given method. A good number of the methods on that page give no hint of where a method is defined (short of going to the source), or give no indication of the return value.

So, I agree the fast.ai library desperately needs better documentation, but not just in example form—actual documentation. I’d be more than happy to help contribute to this. Has anyone spearheaded this effort yet?

As an aside, I’ve been bitten more than once by fast.ai’s non-adherence to semantic versioning. (And, I do understand that this is likely a conscious choice the maintainers have made.) However, do the maintainers expect to reach a point where a patch-level revision really is a patch-level revision in the semver sense (i.e., anything introduced in a minor or patch-level revision is completely backward-compatible, and backward-compatibility is only ever broken in major revisions)?

Benudek · January 21, 2019, 1:49pm

check this @Brennon: Documentation improvements

I think backwards compatibility is relevant when you had the code in production orgs.

sgugger · January 21, 2019, 1:50pm

What do you mean by that?

You are more than welcome to suggest any PR to add more documentation, I don’t believe the choice of doing it in notebooks (which was the topic here) will limit you in any way.

We are getting there and almost at that point. By the end of the second part of the course there will be a stable release.

Brennon · January 27, 2019, 7:30pm

I mean that in many “chainable” APIs, you can take the return value of whichever method and call some other method of your choice (think jQuery, etc.) Sure, there may be some restrictions, but it’s awfully hard to know what those are if the return values aren’t well-documented. In other words, in the data block API you need to know that something returns a ItemList, a DataBunch, or something else before knowing what you can call. Python’s dynamic typing complicates this even further.

With respect to notebooks, I think that kind of documentation is useful, but in my opinion no replacement for documentation that lives with/is generated by the code itself (built off of type hints, etc.)