Fastai-nbstripout: stripping notebook outputs and metadata for git storage

stas · August 9, 2018, 9:43pm

I’ve noticed that this strip out filter is causing merging/stashing problems:

git stash
Saved working directory and index state WIP on master: c945e18 - cleanup duplicated imports - use imported shortcut PIL.Image -> Image

git pull origin master
From github.com:fastai/fastai_v1
 * branch            master     -> FETCH_HEAD
error: Your local changes to the following files would be overwritten by merge:
        docs/fastai_v1.nb_001b.ipynb
        docs/fastai_v1.nb_002.ipynb
        docs/fastai_v1.nb_002b.ipynb
Please commit your changes or stash them before you merge.
Aborting

Note I haven’t touched those files under docs/.

I think git stash relies on git diff to see what changes need to be stashed, and since git diff w/ the filter returns nothing, nothing gets stashed. And then git pull sees a conflict and aborts. Any idea how to resolve that?

I ended up overwriting docs with git checkout docs, but perhaps there is a better way.

stas · August 9, 2018, 9:50pm

Also you may have noticed that nbstripout is slow. This article suggests using jq, a lightweight and flexible command-line JSON processor, which is supposed to be much faster.

The author gives an example of the setup he uses right now to accomplish something similar to nbstripout:

jq --indent 1 \
    '
    (.cells[] | select(has("outputs")) | .outputs) = []
    | (.cells[] | select(has("execution_count")) | .execution_count) = null
    | .metadata = {"language_info": {"name":"python", "pygments_lexer": "ipython3"}}
    | .cells[].metadata = {}
    ' 01-parsing.ipynb

sgugger · August 9, 2018, 9:53pm

Had the same problem, and basically removed everything before re-cloning.
Is it a one-time-thing because they were stripped recently?

jeremy · August 9, 2018, 9:58pm

The speed issue seems to be because nbformat/reader.py imports json. If you instead import ujson as json it’ll go much faster.

A simple python file using dict comprehensions and ujson may get nearly all the way there.

stas · August 9, 2018, 10:50pm

I understand that you’re suggesting to rewrite nbstripout from scratch. I have a feeling the tricky part would be to keep the output format identical to nbformat.

For the heck of it I tried to hack ujson in place of json (since I can’t change nbformat/reader.py). But it didn’t work - as ujson and json are different. I added in fastai-nbstripout:

import ujson
sys.modules['json'] = ujson

Traceback (most recent call last):
  File "tools/fastai-nbstripout", line 120, in <module>
    from nbformat import read, write, NO_CONVERT
  File "/home/stas/anaconda3/envs/pytorch-dev/lib/python3.6/site-packages/nbformat/__init__.py", line 15, in <module>
    from . import v2
  File "/home/stas/anaconda3/envs/pytorch-dev/lib/python3.6/site-packages/nbformat/v2/__init__.py", line 27, in <module>
    from .nbjson import reads as reads_json, writes as writes_json
  File "/home/stas/anaconda3/envs/pytorch-dev/lib/python3.6/site-packages/nbformat/v2/nbjson.py", line 31, in <module>
    class BytesEncoder(json.JSONEncoder):
AttributeError: module 'ujson' has no attribute 'JSONEncoder'

Reading some reports, it’s suggested that ujson is quite buggy. Instead it’s been suggested to use rapidjson. Here are some recent benchmarks.

jeremy · August 9, 2018, 10:58pm

I’ve spent much less time thinking about this than you, so I’m happy to take your word for it. I can’t see that jq would make this any easier…

I’m always a little out of date…

stas · August 9, 2018, 11:02pm

I’m not quite sure yet. This is a totally new setup for me and I have been trying to wrap my head around it. I think we are bound to hit ‘no free lunch’ here - since there will be always a merge if we commit only parts of the file and it has changed since it was checked out, so we can’t avoid a merge.

I think in this situation we need to disable .gitattributes filters before doing git stash, so that git stash will be able to make a clean diff without any filters interfering. In my very brief experience with git, it appears that git pull must succeed after git stash and this is not the case right now.

So one easy way would be to run:

git config --local --unset include.path

which would turn off the stripping out setup. Then do the update/merge and then re-enable the stripping out again:

git config --local include.path '../.gitconfig'

If you find an easy solution please post. Meanwhile I’m experimenting with various ideas.

Thanks.

jeremy · August 10, 2018, 12:27am

I tend to just do a local commit, instead of stashing. You can always revert it, or choose different merge strategies as required after pulling. Is that an option for you? (Obviously it would be better if stash worked - just looking for compromises, if they’re needed.)

stas · August 10, 2018, 12:55am

Isn’t the default commit a local commit?

git commit

that’s the thing - git commit followed by git push aborts because HEAD has changed and trying to rebase the local repo via git pull aborts because it can’t merge.

Perhaps you mean something different by a local commit.

jeremy · August 10, 2018, 4:15am

I meant git commit, not followed by push. So if you git commit, then git pull, it works fine.

stas · August 10, 2018, 4:40am

I haven’t tried that. I will give it a try next time I encounter the same. Thank you for the suggestion, Jeremy.

stas · August 11, 2018, 6:17pm

At this moment the quoted approach seems to be a working workaround, Jeremy’s suggestion didn’t work. Observe:

$ git checkout my_branch
error: Your local changes to the following files would be overwritten by checkout:
        dev_nb/experiments/caltech101.ipynb
        dev_nb/experiments/comp_gs_numpy_torch.ipynb
        dev_nb/experiments/pipelines_speed_comp.ipynb
        dev_nb/experiments/random_resize_crop.ipynb
Please commit your changes or stash them before you switch branches.
Aborting

(note that I haven’t touched dev_nb/experiments/* - it thinks there are local changes since the version under git isn’t matching what it would have been after strip out)

$ git diff
$ git pull
Already up to date.
$ git config --local --unset include.path
$ git checkout my_branch
Switched to branch 'my_branch'
$ git config --local include.path '../.gitconfig'
$ git checkout my_branch
Already on 'my_branch'

and it works with stashing too in this situation (once the filter is disabled).

and now let’s fix the notebooks under dev_nb/experiments:

# force manual strip out
tools/fastai-nbstripout dev_nb/experiments/*ipynb
# disable the strip out filter to get git to see changes
git config --local --unset include.path
git commit -a
git push
# restore the filter
git config --local include.path '../.gitconfig'

Voila, this shouldn’t be a problem anymore.

There should be no need to disable the filter ever again (unless we change fastai-nbstripout and then we will need to rerun all notebooks through it manually (see code above) so that git version matches the local stripped out version).

stas · August 11, 2018, 9:45pm

Just to sum things up - I’m very happy that we facilitated the strip out process - it’s so much easier now to commit/PR changes to the notebooks! Thank you for supporting this change.

stas · August 14, 2018, 4:17am

I wrote a new stripout script tools/fastai-nbstripout-jq that uses jq. It works about 10-20 times faster than nbstripout.

@jeremy, can you please have a look and to tell me whether you’re happy with it, and whether we want to keep both versions or kill the python version or just no need to decide right now?

To switch to the new version:

apt install jq
cd fastai_v1
git config --local --unset include.path
git config --local include.path '../.gitconfig-jq'

alternatively, this is a universal version which will choose the right config depending on whether jq is installed or not, so it’s suitable for scripting:

git config --local include.path '../.gitconfig'$(if command -v jq >/dev/null 2>&1; then echo -n -jq; fi);

Here is a quick benchmark:

Running on 3 docs notebooks (on purpose in non-docs mode so that the filter will have to do some work of stripping outputs and some metadata):

cd docs
mkdir test1
mkdir test2
cp fa*.ipynb test1
cp fa*.ipynb test2

# identical inputs
diff -ru test1 test2

time ../tools/fastai-nbstripout test1/*

real    0m0.223s
user    0m0.192s
sys     0m0.032s

time ../tools/fastai-nbstripout-jq test2/*

real    0m0.016s
user    0m0.013s
sys     0m0.003s

# identical outputs
diff -ru test1 test2

This is about 10-20 times faster, and now I can’t really see any delays whatsoever when using git.

For posterity (and those who search for similar solutions) here are the 2 filters - one took a bit of a long trial and error to figure out:

### filter for doc nbs ###

# 1. reset execution_count
# 2. keep only certain cell metadata fields
# 3. keep only certain nb metadata fields

# to add more metadata entries to keep do:
# if (.key == "key1" or .key == "key2")

filter_docs='
   (.cells[] | select(has("execution_count")) | .execution_count) = null
  | .cells[].metadata |= with_entries(
      if (.key == "hide_input")
      then .     # keep the entry
      else empty # delete the entry
      end)
  | .metadata = {"kernelspec": {"display_name": "Python 3", "language": "python", "name": "python3"}}
'

### filter for code nbs ###

# 1. reset execution_count
# 2. delete cell's outputs
# 3. delete cell's metadata
# 4. keep only certain nb metadata fields

filter_code='
    (.cells[] | select(has("execution_count")) | .execution_count) = null
  | (.cells[] | select(has("outputs")) | .outputs) = []
  | .cells[].metadata = {}
  | .metadata = {"kernelspec": {"display_name": "Python 3", "language": "python", "name": "python3"}}
'

and to run:

jq --indent 1 "$filter" "$file"

jeremy · August 14, 2018, 2:05pm

Nice one! I definitely have been wanting git to be faster in fastai_v1, so this is valuable. But we also need to ensure that the ux for new devs is smooth - if they don’t have jq, we don’t want odd errors to appear in their console.

I know I’ve asked this before, and apologies if I didn’t understand your answer properly, but your jq solution makes me wonder again whether that same filter couldn’t be achieved nearly as easily with python and rapidjson (and for those without rapidjson installed, we can easily have it fallback to json with try/catch)? Do you think jq is significantly faster than rapidjson (if it is, then my suggestion doesn’t really help)?

stas · August 14, 2018, 4:33pm

Well, they will still need to install external libraries, so pip --install whateverjson, or apt install jq isn’t very different.

But I hear you. I have a few other things to deal with first and then I will give it a try.

I think also that current nbstripout doesn’t identify a critical path of doing its real work, so there are all kinds of things happening in it, that aren’t needed during the stripout. All that install/uninstall craft should really be isolated from its real application. all those small overheads add up over many notebooks. Ideally, github would be running such filter as a daemon, so everything is loaded once and there is no loading overhead for each notebook.

Thank you for checking it out, Jeremy.

jeremy · August 14, 2018, 5:41pm

I don’t think that’s quite true - json is stdlib, and we can have transparent fallback to it.

stas · August 14, 2018, 6:05pm

OK, so we weren’t on the same page. I thought you were thinking about a new python implementation using ujson/rapidjson.

Now you have clarified that you were thinking about a transparent fallback. As I said earlier it will take some workarounds since json and ujson for example don’t have fully identical API.

And as I wrote earlier, this is transparent fallback as well (albeit during config):

git config --local include.path '../.gitconfig'$(if command -v jq >/dev/null 2>&1; then echo -n -jq; fi);

They have to run this command line anyway on check out. So if they have jq they will get the faster version, otherwise it’ll be the slow one.

And now that you clarified what you had in mind I can look at implementing it.

Thank you.

jeremy · August 14, 2018, 6:16pm

You can get a dict from a json file using a consistent api between rapidjson and json afaict. The “incompatibility” section of their readme only mentions stuff that doesn’t seem to be needed for our purpose. So I’m thinking we just have something like:

try:
    import rapidjson as json
except:
    import json

d = json.load('...')
d = [munge(o) for o in d if do_keep(o)]

…and then munge does whatever is needed to change a cell, and do_keep returns True for those that shouldn’t be removed. Or something like that. I’d assume these would look quite similar to the jq filter you have above. Does that make sense?

BTW one issue to be aware of: the shell 1-liner isn’t going to work on Windows (other than WSL, which isn’t usable for CUDA), and @sgugger and I are both developing on that platform - and we’re trying to ensure we provide good Windows support since it’s widely used in some places still.

stas · August 14, 2018, 10:15pm

Yes, of course, but then we are talking about a complete rewrite, not adjusting nbconvert to support a faster json library (which is not possible w/o rewriting nbconvert which nbconvert depends on and heavily uses it). I will give it a try.

It’s not a problem, the bash one-liner is just for a convenience.

What shell do you use on windows?

Is jq a problem under windows?