Documentation effort


(Fred Guth) #1

I guess most people would agree that fast.ai deserves a better documentation and a lot of people from this forum would benefit from that.

I would like to contribute back to fast.ai helping to build this documentation and I guess other members of this forum may also want to join (as documenting is a great way to learn more about the library).

Is there already an organised documentation effort in place?


#2

Since we are in the process of redoing everything (a bit) differently, documentation of the current library is more or less on hold. For fastai_v1 however, we will definitely want something better than what is available right now, with notebooks in Colab or/and Sagemaker so that the user can immediately play around with the parameters.

Nothing has been really written yet, as we are in the first stages of development, but stay tuned! We’ll ask for help with testing and documenting as soon as the core is ready.


#3

I don’t have much experience with collab / sagemaker but that sounds quite cool for people who don’t have their machine set up / are only starting their journey :+1:

I was planning on doing docker setups for the new functionality, where you run a docker container and it pulls the data for you and you can open a jupyter notebook and hit run all cells and it works. Anyhow, maybe will go with this idea at least to some extent.

Sucks a bit getting data is so involved. For instance, there is the imagenet data on Kaggle, but you need to log in to pull it I believe or authenticate via the API… Probably for colab / sagemaker the data can be preloaded? Would be nice to have some sort of canonical repository of the datasets used for fastai lectures… CIFAR10, dogsvscats, imdb, VOC pascal, maybe even COCO but I guess probably the legality of this comes into play… I have been downloading things from pjreddie.com - this gentleman is kind enough to host the CIFAR10 and pascal so at least there’s that :slight_smile:


(Jeremy Howard) #4

I suspect that kaggle kernels might be better than colab for this reason - we can have the datasets there already.

As well as the notebooks, there will be complete documentation in regular web pages - and help will certainly be appreciated on this once we’ve got some functionality finalized so it can be documented!


(Fred Guth) #5

Putting documentation on hold makes sense if the api is going to change. Please let us know when is a good time to get involved.


(Stas Bekman) #6

why not put the datasets on http://archive.org/? Free and requires no auth to access. There are hundreds of TBs of data dumps there, e.g.: https://archive.org/details/archiveteam

could create a nice little fastai account and put there all kinds of data goodies.


(Jeremy Howard) #7

Can anyone just upload a dataset there? When I click “upload” it says: “Please contribute books, audio, and video files that you have the right to share.”. It doesn’t mention adding datasets.

A benefit of putting them on kaggle datasets is that kaggle kernels can then access them directly, which really helps accessibility.


(Stas Bekman) #8

I looked a bit more, it’s loaded with datasets already: https://archive.org/search.php?query=dataset
They even have a dedicated collection for datasets: https://archive.org/details/datasets so if you tag the newly uploaded dataset with that collection name it’ll end up there.

For example go to:

click on “Show All” in the right column (on wide screen) and you get a direct link to the dataset:

https://archive.org/download/imdb-dataset-2017-10-11/imdb-dataset-2017-10-11.tar.gz

Done.

As long as the dataset is in public domain, or has a license that supports sharing, it should fit just right on archive.org.

As I said earlier, I won’t worry about sizes of datasets - they host some data that’s many many TBs big.


(Stas Bekman) #9

http://academictorrents.com/ is another place with 26TB in datasets as of this moment. But I’d trust archive.org much more long term. It’ll be around long after academictorrents will be gone - but it might be good as a backup.

edit: please scratch that suggestion - it won’t work because it provides no direct links, and requires torrent/magnet downloads.


#10

Interesting, why is that? I have known their “wayback machine” for a long time, but wouldn’t bet they are willing to host many large datasets (storage and bandwidth can be quite costly).


(Stas Bekman) #11

I meant that between a small private endeavor and a huge established archive.org, the latter has a better chance to survive.

Plus, you can see that it already supports datasets as shown in my earlier post here.

According to its wiki: “As of October 2016, its collection topped 15 petabytes.” So what’s another few TBs :wink:

That’s why I thought it would be worth trying.


#12

Yes that’s a good point, although their primary mission is the archival of the web through continued crawling. If people start using it as a means of sharing large datasets, my intuition is that when they need to reduce costs, this is the first thing they will cut.


(Vu Ha) #13

How about using unit tests (using Pytest for example) as a way to gently start the documentation effort? When refactoring a function such as compose from a notebook to a module, it may be only a small effort to write a unit test that gives users a sense of what compose does.


(Jeremy Howard) #14

We’ve decided today we’re going to have a go at using jupyter notebooks as the source of our docs, and some custom conversion tools to turn them into proper hyperlinked documentation. More info coming soon!


(Stas Bekman) #15

update: as of Aug 12, many of the datasets we use are already on archive.org, e.g.: cifar-10, whose entry page is here. And many other datasets are here.

When you go to an entry page like cifar10, click on Show All to get the complete file listing. And that’s where you get the direct link to the dataset archive, which should stay the same for all.