Fastai-nbstripout: stripping notebook outputs and metadata for git storage

stas · August 5, 2018, 11:59pm

Unless I’m missing something obvious, currently once a notebook is committed, it becomes very complicated to commit changes to it. (for PRs and those with commit access). This is because we have all cells committed to the repository (and not just source cells).

I understand the reason for having outputs cells committed - a convenience while teaching, or even reading if you don’t necessarily want to run the notebook to see what it does. No objections here.

But there must be a better way for us to submit changes to aforementioned notebooks.

Currently my process is:

make changes in the notebook to my satisfaction and see it working - save .ipynb from jupyter interface
open an unmodifed notebook in an editor - find changes and manually copy them to that notebook (a big pain in the ass since you have to tweak json - ok when you just change a few words, a pain if you add/remove lines or cells) - save.
load the modified notebook and test it working.
since meanwhile jupyter will overwrite the notebook, force re-save of the version open in the editor
shutdown kernel so that it won’t overwrite the notebook
git diff (to validate again)
git commit

Granted I could turn auto-saving off, but it doesn’t make things much easier.

Here are some things I have been considering to solve this problem.

1. nbdime

nbdime is a fantastic tool to work with notebooks. Currently I have it configured to ignore everything but source cells, so when I do nbdiff (or can configure git diff to invoke nbdiff), I get a cool diff of just the source cells, ignoring all the metadata, outputs and execution_counts.

Except this doesn’t help with commits. I can see what my source cell modifications diff, but when I commit all the other cells that got modified (and jupyter pretty much changes all of them every time you run the notebook) get committed too. Not good!

I suppose I could use nbmerge:

Merge two Jupyter notebooks “local” and “remote” with a common ancestor “base”.

but then I still can’t re-test the merged notebook, unless I make sure that jupyter doesn’t overwrite it while I’m testing it.

I like the autosave function and it’s useful all the time except during testing a change about to be committed.

At the very least it’d be good to have a toggle button to turn Autosave On/Off on demand.

2. Adding cell-level granularity to jupyter notebook’s save function

Then I was thinking that perhaps jupyter’s save/autosave functionality could be modified to let the user configure what cells get saved.

So for example then we could have just source and output cells under git, and the rest to be ignored. This would already simplify things a bit.

But if there was away for me to tell jupyter notebook to overwrite only source cells then it’d make the PR/commit process so much simpler. You’d just modify the notebook in jupyter and commit it as is. As only source cells would ever change (or get added/deleted).

While I think this would be relatively easy to implement for changes in source cells, I have a feeling that it’d be very difficult to implement for adding/removing/moving source cells. As this would require some complicated tracking of the original and the final results, and it’d be a complex merge. So I’m not quite sure this can be automated.

3. Keeping just the `source` cells in git

And of course all these problems would go away if we were to only ever keep the source cells under git. (nbstripout on commit or something similar). Then one can just edit the notebook in jupyter and commit right away (with the same nbstripout setup).

As I mentioned at the beginning of this post I understand why outputs are under git, but it wastes a significant developing time and is error-prone. Perhaps we can think of a different solution for having the outputs and still keeping the source cells separate?

Thoughts?

And of course if you have an easier process to committing notebook changes (must include a pre-commit testing run) please kindly share.

Thank you.

p.s. these look like relevant to this subject matter links:

and here is a fresh example of day-old commits which change all cells while improving comments:

Not reviewer-friendly at all, a big cause for conflicts and leading to much more complicated merge process
(and nothing personal against @lesscomfortable, this was just an example to illustrate the point).

stephenjohnson · August 6, 2018, 1:24am

It sounds like it’s going to get even more complicated if there are multiple outstanding PRs which will result in merge conflicts. Probably a crazy idea but any possibility of only saving the source cells and using a build process to create the output cells. I know Jeremy has said during the course videos and elsewhere that he hasn’t necessarily run the cellls top to bottom so if that’s still going to be the case going forward than this wouldn’t work.

jeremy · August 6, 2018, 3:50am

I’m hoping this problem will be much reduced as we move to having more things happening in parallel, with much of that effort on the final python modules, rather than the intermediate nbs. Nonetheless, it is a problem worth trying to solve!

That’s the only realy solution I can see. It can’t be a server-side hook, since Github doesn’t have our data folders or a GPU to run on. So it would probably be something we just run manually or semi-automatically on our own machines to ensure that the “build artifact” version of the notebooks (in some different folder) is updated.

That’s for something different - these ones I run top to bottom.

jeremy · August 6, 2018, 3:57am

FYI this step is easier if you just copy the cells directly between notebooks in the jupyter web interface (you can ctrl-c ctrl-v cells between notebooks). No need to edit the json directly, unless I’m misunderstanding something.

Also, I generally copy the notebook I’m working on to tmp_{name}.ipynb and work on that (and I put tmp* in .gitignore). Then I can copy over the changed bits when I’m done. (I haven’t been doing this for fastai_v1 so far, but often do it for the course notebooks).

stas · August 6, 2018, 4:37am

That is not a problem. The problem is all the non-source attributes and outputs that are now different if you copy the whole cell (via jupyter or by hand). So if you commit that, you end up pulling those in as well and not just the code difference.

So far I’ve been manually plucking modified source code and replacing the corresponding cells in a 'git pull’ed master copy. And when I added a cell, I’d add a complete set of cells to the notebook that is going to be committed. Which takes an immense amount of work and concentration. So if someone looks at my PRs they are code only (e.g. here is diff example).

I’m sure the reviewer has a much easier time reviewing and approving such changes, and they cause a minimal potential for a conflict if someone works in parallel on the same notebook. Compare this to the example I have shown earlier.

stas · August 6, 2018, 4:39am

Perhaps, you’d consider to at the very least switch to that source-only-cells-in-git for now, while everything is changing. And worry about generating teachable notebooks later (or even falling back on having outputs in, but later)?

And if you’d agree to that, we need to provide an easy to copy git hook setup to strip all but source cells, that anybody who would like to contribute can setup on their machine. or perhaps such instrumenting can be made built-in into the fastai_v1 repository (I don’t know whether this can be done), so it will just happen as anybody commits (given she has nbstripout or similar installed or otherwise it’d fail telling them to install it).

stas · August 6, 2018, 4:43am

AFAIK, git won’t commit a file that you haven’t explicitly added, so you can use any name for it and not worry about it. Though your convention is a good one and helps with git status indicating any real files that may need to be committed.

jeremy · August 6, 2018, 1:38pm

That makes sense.

Such a tool could also set ‘collapsed’ to True, and remove the execution order numbers. That would remove nearly all the sources of conflict, right?

Would we use this hook?: git/Documentation/githooks.txt at master · git/git · GitHub

(BTW, this process will also be important for doc authors, since we’re going to try to use notebooks for doc authoring too.)

stas · August 6, 2018, 4:59pm

That’s correct based on my reading on the subject matter. I will now try to implement it.

Which of the 2 collapsed are you talking about:

This is output collapsed (e.g. output image hidden):

   "metadata": {
    "collapsed": true
   },

It’d become irrelevant since we won’t have outputs under git.

This is section collapsed:

   "metadata": {
    "heading_collapsed": true
   },

We will be stripping out all metadata, otherwise there will still be conflicts - who can possibly remember which sections were collapsed and which not when they started working on the notebook. Unless of course we re-add those back with a fixed state "heading_collapsed": true all the time. Let me experiment with that.

Also personally I find that it’s much easier to work with the whole notebook unrolled, since you can’t Ctrl-F search inside collapsed sections.

I use ‘Collapsible Headings’ extension which in one click can collapse/uncollapse the whole notebook (or individual sub-sections) (Ctrl-Shift-Left / Ctrl-Shift-Right).

And another thing to think about is (not) saving the state of locally enabled extensions. e.g. I use Table of Contents when working with large notebooks, but this metadata probably shouldn’t go under git since it’s a personal choice.

I will now work on setting up the git hook and post back once I have a working setup.

stas · August 6, 2018, 6:05pm

edit: see next post for a one-liner install

So do this:

# install nbstripout
pip install nbstripout

# check it's in the path:
which nbstripout

# switch to the repository you want to work in
cd fastai_v1/

# add to .gitattributes or .git/info/attributes:
*.ipynb filter=nbstripout

# these will modify .git/config
git config filter.nbstripout.clean `which nbstripout`
git config filter.nbstripout.smudge cat
git config filter.nbstripout.required true

you’re all set for this repository. Now your commits will be run through nbstripout, while leaving your local notebook unmodified.

Let’s look at 2 commits using this setup:

the fist one stripped most of the non-source data, so it’s big:

2nd commit now into the stripped notebook is clean and simple:

diff

For “cell_type”: “code” it strips all but “source”:

  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "x = y"
   ]
  },

For “cell_type”: “markdown” it doesn’t strip info on collapsed headings (if the committer had them changed from the setting before - it will change in git).

  {
   "cell_type": "markdown",
   "metadata": {
    "heading_collapsed": true
   },
   "source": [
    "## Combine"
   ]
  },

but perhaps we should make this consistent - all collapsed or all non-collapsed.

In addition to commit hook, you probably want to set up git diff to go through the same stripping so that you will see what will be committed. To instrument that, do:

# add to .gitattributes or .git/info/attributes:
*.ipynb diff=ipynb

# this will modify .git/config
git config diff.ipynb.textconv "$(which nbstripout) -t"

The only thing I’m not sure yet is how we can make this setup into the repository so that everybody uses it. I’d imagine it shouldn’t be too difficult to have nbstripout installed at the same location - or perhaps it can be made part of the repository (especially if we decoded to do some tweaks to it).

stas · August 6, 2018, 6:14pm

Somehow i missed the automatic config, all you need to configure both diff and commit hooks is:

cd fastai_v1/
nbstripout --install

and if down the road we want to re-add outputs, we just add --keep-output (to nbstripout). I think it’d be an excellent choice to not keep execution_count and most other metadata under git.

jeremy · August 6, 2018, 6:20pm

This would matter less if I could get the ctrl-shift-left/right things to work. I’ve tried setting from the configurator to different keys and it still doesn’t work. I’ve updated the extension to the latest and restarted jupyter. Did you do anything clever to make them work? I’ve checked the js console (nothing) and tried chrome and firefox.

stas · August 6, 2018, 6:26pm

Make sure you’re in the Command mode (so that if you’re focused in any cell its frame is blue and not green, and if it is green hit Esc).

and check that your extension is pushing those shortcuts in:

Did it help?

p.s. jupyter doesn’t let extensions or custom user hooks to use shortcuts in the Edit Mode (mostly I think).

stas · August 6, 2018, 6:36pm

We will need extra stripping instrumenation, e.g. currently it doesn’t strip extension configuration, e.g. if I use ToC, I end up with:

-   "toc_position": {},
+   "toc_position": {
+    "height": "calc(100% - 180px)",
+    "left": "10px",
+    "top": "150px",
+    "width": "290.391px"
+   },
    "toc_section_display": "block",
-   "toc_window_display": false
+   "toc_window_display": true

this is not good. And same goes for collapse headers extension - it adds a bunch of noise:

   {
    "cell_type": "markdown",
-   "metadata": {},
+   "metadata": {
+    "heading_collapsed": true
+   },
    "source": [
     "# Fin"
    ]
@@ -1154,7 +1310,9 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "metadata": {},
+   "metadata": {
+    "hidden": true
+   },
    "outputs": [],
    "source": []

I’m looking at how to tell nbstripout to kill it all. … filed a request here: https://github.com/kynan/nbstripout/issues/85

Alternatively we could fork nbstripout, modify it to our liking, commit to the fastai_v1 repository and use that. Probably renaming it first to avoid confusion with the normal nbstripout.

jeremy · August 6, 2018, 7:54pm

@stas I installed that now, and pushed some notebooks in stripped form. @313v @lesscomfortable @sgugger you folks should probably install it too.

sgugger · August 6, 2018, 8:11pm

This won’t work with the documentation notebooks though, because we want the outputs of those for the html conversion.

jeremy · August 6, 2018, 8:21pm

Yeah that’s true - although maybe for the initial dev it’s OK.

stas · August 7, 2018, 2:43am

First, I think we should be able to customize nbstripout to have a complete control at what goes in. We will just need to modify it to suit to our needs.

Second, there is no problem with outputs if they are deterministic and don’t change when the notebook is run multiple times. Can you share some of those docs ideas you’re thinking about - perhaps a small notebook we can play with? I’m thinking if there is a specific pattern to those docs, then a modified nbstripout would leave the docs outputs while dumping any other outputs. but I would love to see the docs notebook to be able to think about feasibility of this path.

Thank you.

stas · August 7, 2018, 2:48am

If inside the checked out repo you move .git/info/attributes to .gitattributes and commit that file, it’ll now require all committers to have this configured. But let’s wait while we sort out the details and probably placing a custom nbstripout into the repository. Should we have under fastai_v1/: tools/ , utils/ or build/ for helper utils?

stas · August 7, 2018, 5:05am

Well, it proved trivial to modify nbstripout to strip other metadata inserted by extensions (at least the ones I use, I guess we can add more if we find other people using other extensions):

diff --git a/nbstripout.py b/nbstripout.py
index ef16ff3..c67c080 100755
--- a/nbstripout.py
+++ b/nbstripout.py
@@ -152,6 +152,8 @@ def strip_output(nb, keep_output, keep_count):

     nb.metadata.pop('signature', None)
     nb.metadata.pop('widgets', None)
+    nb.metadata.pop('toc', None)
+    nb.metadata.pop('varInspector', None)

     for cell in _cells(nb):

@@ -189,7 +191,7 @@ def strip_output(nb, keep_output, keep_count):
             if output_style in cell.metadata:
                 cell.metadata[output_style] = False
         if 'metadata' in cell:
-            for field in ['collapsed', 'scrolled', 'ExecuteTime']:
+            for field in ['collapsed', 'scrolled', 'ExecuteTime', 'heading_collapsed', 'hidden']:
                 cell.metadata.pop(field, None)
     return nb

so once you get a chance to share what you plan for docs I’d be happy to try to get doc-specific outputs to not get stripped (but we will need some kind of pattern to match to make an exception).

I think this should be stripped out as well as this metadata could vary from person to person (3.6.5, etc.):

  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.6"
  }

Do you see anything else in your stripped out version that we need to strip out?

To check, run something like:

cat 002_images.ipynb | /tmp/nbstripout/nbstripout.py  > OUT.ipynb

then look mainly in the very last section of OUT.ipynb - the global metadata entries.