I guess most people would agree that fast.ai deserves a better documentation and a lot of people from this forum would benefit from that.
I would like to contribute back to fast.ai helping to build this documentation and I guess other members of this forum may also want to join (as documenting is a great way to learn more about the library).
Is there already an organised documentation effort in place?
Since we are in the process of redoing everything (a bit) differently, documentation of the current library is more or less on hold. For fastai_v1 however, we will definitely want something better than what is available right now, with notebooks in Colab or/and Sagemaker so that the user can immediately play around with the parameters.
Nothing has been really written yet, as we are in the first stages of development, but stay tuned! We’ll ask for help with testing and documenting as soon as the core is ready.
I don’t have much experience with collab / sagemaker but that sounds quite cool for people who don’t have their machine set up / are only starting their journey
I was planning on doing docker setups for the new functionality, where you run a docker container and it pulls the data for you and you can open a jupyter notebook and hit run all cells and it works. Anyhow, maybe will go with this idea at least to some extent.
Sucks a bit getting data is so involved. For instance, there is the imagenet data on Kaggle, but you need to log in to pull it I believe or authenticate via the API… Probably for colab / sagemaker the data can be preloaded? Would be nice to have some sort of canonical repository of the datasets used for fastai lectures… CIFAR10, dogsvscats, imdb, VOC pascal, maybe even COCO but I guess probably the legality of this comes into play… I have been downloading things from pjreddie.com - this gentleman is kind enough to host the CIFAR10 and pascal so at least there’s that
I suspect that kaggle kernels might be better than colab for this reason - we can have the datasets there already.
As well as the notebooks, there will be complete documentation in regular web pages - and help will certainly be appreciated on this once we’ve got some functionality finalized so it can be documented!
Can anyone just upload a dataset there? When I click “upload” it says: “Please contribute books, audio, and video files that you have the right to share.”. It doesn’t mention adding datasets.
A benefit of putting them on kaggle datasets is that kaggle kernels can then access them directly, which really helps accessibility.
http://academictorrents.com/ is another place with 26TB in datasets as of this moment. But I’d trust archive.org much more long term. It’ll be around long after academictorrents will be gone - but it might be good as a backup.
edit: please scratch that suggestion - it won’t work because it provides no direct links, and requires torrent/magnet downloads.
Interesting, why is that? I have known their “wayback machine” for a long time, but wouldn’t bet they are willing to host many large datasets (storage and bandwidth can be quite costly).
Yes that’s a good point, although their primary mission is the archival of the web through continued crawling. If people start using it as a means of sharing large datasets, my intuition is that when they need to reduce costs, this is the first thing they will cut.
How about using unit tests (using Pytest for example) as a way to gently start the documentation effort? When refactoring a function such as compose from a notebook to a module, it may be only a small effort to write a unit test that gives users a sense of what compose does.
We’ve decided today we’re going to have a go at using jupyter notebooks as the source of our docs, and some custom conversion tools to turn them into proper hyperlinked documentation. More info coming soon!
update: as of Aug 12, many of the datasets we use are already on archive.org, e.g.: cifar-10, whose entry page is here. And many other datasets are here.
When you go to an entry page like cifar10, click on Show All to get the complete file listing. And that’s where you get the direct link to the dataset archive, which should stay the same for all.
IMHO, documentation by example is very useful, but at the end of the day, there is no substitute for thorough API documentation itself. Especially when an API relies so much on indirection (e.g., most of the magic in method X is done by these four public methods that receive their arguments through additional arguments passed to method X), not having actual documentation can make using the library very difficult.
As a case in point, I was working a couple of weeks ago with the data block API. The narrative introduction to that section of the docs reads:
The data block API is called as such because you can mix and match each one of those blocks with the others, allowing for a total flexibility to create your customized DataBunch for training, validation and testing.
This (justifiably, I believe) led me to believe that this methods in this API were chainable, as a number of the examples demonstrate. The problem is, some of those methods are chainable, and some aren’t. The only way to understand what can be chained with what is by knowing on what class a given method is defined, and by knowing the return values of a given method. A good number of the methods on that page give no hint of where a method is defined (short of going to the source), or give no indication of the return value.
So, I agree the fast.ai library desperately needs better documentation, but not just in example form—actual documentation. I’d be more than happy to help contribute to this. Has anyone spearheaded this effort yet?
As an aside, I’ve been bitten more than once by fast.ai’s non-adherence to semantic versioning. (And, I do understand that this is likely a conscious choice the maintainers have made.) However, do the maintainers expect to reach a point where a patch-level revision really is a patch-level revision in the semver sense (i.e., anything introduced in a minor or patch-level revision is completely backward-compatible, and backward-compatibility is only ever broken in major revisions)?
You are more than welcome to suggest any PR to add more documentation, I don’t believe the choice of doing it in notebooks (which was the topic here) will limit you in any way.
We are getting there and almost at that point. By the end of the second part of the course there will be a stable release.
I mean that in many “chainable” APIs, you can take the return value of whichever method and call some other method of your choice (think jQuery, etc.) Sure, there may be some restrictions, but it’s awfully hard to know what those are if the return values aren’t well-documented. In other words, in the data block API you need to know that something returns a ItemList, a DataBunch, or something else before knowing what you can call. Python’s dynamic typing complicates this even further.
With respect to notebooks, I think that kind of documentation is useful, but in my opinion no replacement for documentation that lives with/is generated by the code itself (built off of type hints, etc.)