Fastai-nbstripout: stripping notebook outputs and metadata for git storage

When I was looking at this problem a couple of years ago, https://gab41.lab41.org/commit-and-push-to-github-from-jupyter-notebooks-579f5743a50b looked pretty interesting.

I don’t remember if we got it working or not though.

Good thinking.

tools

Nope.

I think for the docs we’ll simply create a separate folder for build artifacts, and have a script that puts built versions of them in there. We can have a separate gitignore in that directory to turn off the filter. Sound OK?

1 Like

@stas I tried to use the tool but found no nbstripout.py. Was this file defined by yourself or should it come integrated in the package?

See earlier in thread for the pip install.

Saw the pip install, ran the commands but not sure how to run @stas 's test.

Used:
cat 002_images.ipynb | /home/user/anaconda3/envs/fastai/bin/nbstripout > OUT.ipynb

but OUT.ipynb and 002_images.ipynb are identical (which means nbstripout did not run properly).

How can I make nbstripout actually run over 002_images.ipynb? I did not find any Python file after the install.

Just a suggestion.
Save notebooks

  1. as html with outputs

  2. as notebooks without outputs

in the git

once you install nbstripout, it drops .py. I was just adjusting the tool, so I invoked its original .py version directly

The cat is there to avoid overwriting, as when you call nbstripout file.ipynb it overwrites the original.

Give me a little bit of time, I’m going to make the modified version to be part of fastai_v1 repo and then try again.

Well, if they are all autogenerated by a function, it should be trivial to insert a hidden html markup at the beginning of each output cell, say and then instrument nbstripout to not delete any output cells starting with this tag.

from IPython.display import Markdown, display
def show_doc_from_name(...):
   #mark the cell as unique:
   display(Markdown("<fastaidoc />"))
   the rest of the code   

An alternative approach would be to tap into the ipython API and use the cell’s metadata entry to put a special flag that it’s a docstring, and then make nbstripout not strip out such outputs (and keep the metadata entry for such cells as well). I haven’t yet looked at how to approach that from ipython API, but it should be doable as ‘Collapse Headers’ plugin does that. So we would have under git:

  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": { "docstring": "true" },
   "source": [ "show_doc_from_name(...)"),
   "outputs": ["docstrings"]
  }

edit: I checked how to do this, need to add metadata arg to display, e.g.:

display("great document string here", metadata={"docstring":"true"})
print("some more docs")

now we can change fastai-nbstripout to keep those output cells in, by looking for the special metadata.

The metadata will be only set in the first part of the output cell, but it should be enough to keep the rest.
Alternatively, this metadata can be set in every call for output line that we want under git. So it’ll keep only the output line that have that metadata set. i.e. it’d require:

display("great document string here", metadata={"docstring":"true"})
display("some more docs", metadata={"docstring":"true"})

note: all display_(png|svg|html|etc) support metadata arg.

I think for the docs we’ll simply create a separate folder for build artifacts, and have a script that puts built versions of them in there. We can have a separate gitignore in that directory to turn off the filter. Sound OK?

If you want them in a separate dir, that’s OK too, but you’ll be still having a collision issue with a bunch of committed cells like counts, and metadata, which are useless. So it’d still be very beneficial to strip unnecessary cells.

The way I see it now, either

  1. we have two striped out versions in two separate folders
  • strip out all but source cells for code notebooks
  • keep source and output cells for docs notebooks
  1. or we have one nbstripout setup that is content-aware which would not strip out docs output cells (as long as it has some pattern to match to tell docs from non-deterministic outputs or using a metadata entry).
1 Like

Please ‘git pull’ your repository and if had nbstripout configured already, please remove nbstripout hooks by running from within the repo:

nbstripout --uninstall

now we will be using our own version of the tool.

Changes added:

  • add a custom version of nbstripout as tools/fastai-nbstripout which strips out what nbstripout does, plus other noisy (nb and cell-level) metadata.
  • instrument git to use tools/fastai-nbstripout at diff/commit (added .gitconfig and .gitattributes).
  • add a new docs/dev.md for developer notes, and there add instructions to what needs to be done by developers to make tools/fastai-nbstripout work behind the scenes.

Important! Please see: https://github.com/fastai/fastai_v1/blob/master/docs/dev.md
Unfortunately, git’s security prevents us from having this process fully automated and requires one extra command run on checkout to tell git to trust the local .gitconfig brought from the git repo. Full details are in the link above. tl;dr version, run once per checkout:

cd myrepo
git config --local include.path '../.gitconfig'

Please let me know if this causes any problems to anyone, or if you can think of an even more automated way.

And a request to all who merge PRs to point future contributors to the link above so that they submit clean diffs. Thank you!

It’d have been nice to be able to add a note above https://github.com/fastai/fastai_v1/pulls with instructions to creating PRs. But it doesn’t seem to be possible. I think we will eventually have easy to follow dev docs.

@lesscomfortable, please let me know whether this now works for you (i.e. the custom tools/fastai-nbstripout).

@jeremy, could you please rerun .ipynb’s through the latest incarnation of tools/fastai-nbstripout - as it’ll now strip other metadata that nbstripout didn’t handle before. I’m asking you so not to step on anybody’s toes - you know which files are “safe” to commit. Thank you.

Also if you have good tips/processes relevant for fastai_v1 dev that could go into docs/dev.md please share (but please start another thread, so that we could focus on clean commit/merge process here. Thank you!)

2 Likes

Hi Stas, tried it and it works (Out is 1053 lines long and 002_images.ipynb is 1191 lines long).

Do you want me to check if some specific metadata is correctly filtered out?

Awesome!

At the moment there should be no cell-level metadata remain at all. There are a few remaining entries in the nb-level metadata (very end of the notebook) - I believe those will be identical to all users.

If you find anything else that’s transient in nature and that is not needed to be under git please let me know.

1 Like

For the docs notebooks, it’s only the auto-generated cells that we’ll need to keep the output of, but every cell added by the person writing the doc: the idea is too include pictures, examples of codes, more markdown, links to video etc… And we’ll need all the outputs since it’s what gets converted into html (most of the times, the input will be hidden).
The tools to auto-generate the documentation will also auto-execute the notebooks: we use a function to get the doc strings of the classes/functions because it might change (or the number of arguments can change) since the last time the user wrote the corresponding notebook.

So not sure how the stripped out can work around this.

Thank you providing more input, @sgugger

Well, then as I suggested above, use a different .gitconfig in each sub-dir and add a flag to tools/fastai-nbstripout to handle #2:

1. strip out all but source cells for code notebooks
2. keep source and output cells for docs notebooks

it’ll still be much better for group collaboration to strip out everything else, so the docs notebooks, in addition to source cells, will have outputs in git from the get going.

There is a bit of metadata too that we’ll need to keep for the doc notebooks since the extension to hide input cells works by adding a flag hide_input:true in the metadata of the cell. I think that’s all, but we’ll know for sure when I’ve finished developing the whole thing (hopefully today).

When it’s ready please send me (1) a docs notebook that I can test with (2) example(s) of a stripped out cell that will have all the parts that need to be kept and I will work on adjusting tools/fastai-nbstripout to support the special needs of docs notebooks. That’s is if we agree on having those in a separate folder. Thanks.

The three notebooks in docs named fastai_v1… are samples of what the doc notebooks would look like. It’s best to see them with the nbextension hidden cells activated.

In each cell the attributes ‘source’ and ‘outputs’ should be in untouched, and in metadata, the attributes ‘hide_input’ and sometimes ‘trusted’ (this one will change for the security reasons but I’d like to keep it for now, as I’m trying to figure out if there is a way to have a fastai_v1 signature that would, once properly set up, automatically trust those notebooks) should be left too. The rest (which is only ‘execution_count’ or other fields in the metadata) can be removed.

Hope that’s clear enough.
The process to strip the notebooks automatically is really smooth and I’ll probably ask for your help once the doc scripts are all finished to do something similar with them (to automatically execute the notebook and convert it to html).

@stas many thanks. I’ve stripped all the notebooks in dev_nb. I’ve also moved the info from doc/dev.md to CONTRIBUTING.md, since I think that’s the standard github path and is shown on their UI: https://help.github.com/articles/setting-guidelines-for-repository-contributors/

1 Like

Thank you, Jeremy, for running this and relocating the instructions into a better location!

Going to work on docs notebooks stripping now, thanks to @sgugger’s input.

1 Like

I made the changes to fastai-nbstripout according to your instructions and added the git instrumentation to activate the docs mode.


I think you will need to force a run on the previously committed files for the changes to apply.

FYI after this commit docs/.gitattributes overrides the repo-global .gitattributes for *.ipynb files found under docs/.

Currently all outputs cell’s metada is preserved. If you see anything that can be dropped, or needs to be kept please let me know.

The process to strip the notebooks automatically is really smooth and I’ll probably ask for your help once the doc scripts are all finished to do something similar with them (to automatically execute the notebook and convert it to html).

Thank you for confirming that it works well, @sgugger. Yes, if you need anything give me a shout.

Just stripped the docs nbs.

2 Likes