Fastai-nbstripout: stripping notebook outputs and metadata for git storage

jeremy · August 14, 2018, 10:37pm

jq works fine on windows, thanks for checking. The shells available on windows are powershell and cmd - most people use both. (There’s also bash, but it isn’t fully integrated with standard windows stuff.)

jeremy · August 14, 2018, 11:14pm

@stas: I’ve not played with jq filters before so I had a quick look to see if they seem how they compare to comprehensions. It was fun! I had a quick try at replicating your jq filter and I think this does the same thing (not well tested):

try:    import rapidjson as json
except: import json

def clean_cell(o, code=True):
    o.pop('execution_count',None)
    o['metadata'] = {k:v for k,v in o['metadata'] if k=='hide_input'}
    if code: o.pop('outputs',None)
    return o

def clean_nb(s, code=True):
    s['cells'] = [clean_cell(o, code) for o in s['cells']]
    s['metadata'] = {k:v for k,v in metadata.items() if k=='kernelspec'}

s = json.load(open('jsontest.ipynb'))
clean_nb(s, True)

(One minor difference - I figured it’s fine to leave ‘hide_input’ in code notebooks too.)

What do you think of this approach overall? I haven’t checked to see whether it’s actually performant, although I’m guessing it should be OK since the main time is presumably spent on json parsing, which rapidjson seems pretty good at.

stas · August 15, 2018, 8:37pm

Thank you for the code, Jeremy. That was useful. I’m on it.

First, ujson is out of the game since it doesn’t support the separators argument in dumps(). And its output format is different from what jupyter or json generates. rapidjson doesn’t support that argument either, but it generates json's default format which matches what we need. Observe:

import ujson, rapidjson, json

i = '{ "a": [ {"c":1, "d": [] } ], "b": {} }'

print("\njson:", json.dumps(json.loads(i), separators=(',', ': '), sort_keys=True, indent=1))

print("\nrapidjson:", rapidjson.dumps(ujson.loads(i), sort_keys=True, indent=1))

print("\nujson:", ujson.dumps(ujson.loads(i), sort_keys=True, indent=1))

gets:

json: {
 "a": [
  {
   "c": 1,
   "d": []
  }
 ],
 "b": {}
}

rapidjson: {
 "a": [
  {
   "c": 1,
   "d": []
  }
 ],
 "b": {}
}

ujson: {
 "a":[
  {
   "c":1,
   "d":[

   ]
  }
 ],
 "b":{

 }
}

stas · August 16, 2018, 1:49am

The new python version is here. @jeremy, @sgugger please kindly check that it works for you.

I assume you already have done:

git config --local include.path '../.gitconfig'

I will write more details in the next comment

stas · August 16, 2018, 2:09am

tl;dr:

So, first a surprise, this new implementation is actually some 25-30% faster with json than rapidjson - I have no idea why. Unless I made some mistake in the code.

It also beats jq by 40% or so. whoah!

Here are the full benchmarks:

Benchmark the new nbstripout script once with json, then with rapidjson

rm -r json rapid
mkdir json rapid
cp dev_nb/*ipynb json
cp dev_nb/*ipynb rapid

# insert fake input that we will then strip - to actually test performance
perl -pi -e 's|"execution_count": null,|"execution_count": 99,|' json/* rapid/* jq/*
perl -pi -e 's|"outputs": \[\],|"outputs": [ { "data": { "text/html": [ "<br/>" ], "text/plain": [ "x" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "blah()" ],|' json/* rapid/* jq/*
perl -pi -e 's|"metadata": \{\},|"metadata": {"collapsed": true},|' json/* rapid/* jq/*

echo "new json";      time tools/fastai-nbstripout-fast json/*
echo "new rapidjson"; time tools/fastai-nbstripout-fast -r rapid/*
diff -ru json rapid

new json:
real    0m0.042s
user    0m0.038s
sys     0m0.004s

new rapidjson:
real    0m0.068s
user    0m0.054s
sys     0m0.014s

I will remove -r option which loads rapidjson instead of json once others confirm as well that json beats rapidjson in this setup.

Benchmark the current nbstripout vs new python version vs jq script

rm -r json rapid jq
mkdir json rapid jq
cp dev_nb/*ipynb json
cp dev_nb/*ipynb rapid
cp dev_nb/*ipynb jq

# insert fake input that we will then strip - to actually test performance
perl -pi -e 's|"execution_count": null,|"execution_count": 99,|' json/* rapid/* jq/*
perl -pi -e 's|"outputs": \[\],|"outputs": [ { "data": { "text/html": [ "<br/>" ], "text/plain": [ "x" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "blah()" ],|' json/* rapid/* jq/*
perl -pi -e 's|"metadata": \{\},|"metadata": {"collapsed": true},|' json/* rapid/* jq/*

echo "current"; time tools/fastai-nbstripout json/*
echo "new";     time tools/fastai-nbstripout-fast rapid/*
echo "jq";      time tools/fastai-nbstripout-jq jq/*

diff -ru json rapid
diff -ru json jq


current:
real    0m0.613s
user    0m0.585s
sys     0m0.028s

new:
real    0m0.046s
user    0m0.037s
sys     0m0.008s

jq:
real    0m0.065s
user    0m0.055s
sys     0m0.010s

So this new script is now about 13 times faster than nbconvert and 1.4 times faster than jq (on the dev_nb/*ipynb notebooks).

I’m not surprised about nbconvert’s slowness - I looked through its code and saw that it loads a motherload of things to support a gaziilion of different variations of various nb formats.
And of course the new script doesn’t do any of the install/uninstall handling - we already have it preconfigured.

Further, Jeremy suggested to drop argparse - but I will leave it for now as is. Unless someone beats me to it, until we have so many notebooks that using git gets slow again I don’t see a reason for that.

If we will really need very high performance, it’d probably be much faster to run nbstripout as a daemon and not re-start it for each file as git does now.

stas · August 16, 2018, 2:25am

The only difference I noticed so far with the new implementation is that utf8 characters got encoded:

-   fill:str='█'):\n",
+   fill:str='\u2588'):\n",

Original:

cat dev_nb/004b_alternative_progress_bar.ipynb | grep print_progress
    "def print_progress(iteration:int, total:int, prefix:str='', suffix:str='', decimals:int=1, length:int=50, fill:str='█'):\n",

With old script:

tools/fastai-nbstripout -t dev_nb/004b_alternative_progress_bar.ipynb | grep print_progress
    "def print_progress(iteration:int, total:int, prefix:str='', suffix:str='', decimals:int=1, length:int=50, fill:str='█'):\n",

With new script:

tools/fastai-nbstripout-fast -t dev_nb/004b_alternative_progress_bar.ipynb | grep print_progress
    "def print_progress(iteration:int, total:int, prefix:str='', suffix:str='', decimals:int=1, length:int=50, fill:str='\u2588'):\n",

Not sure why. For some reason my output is ASCII and not UTF8, even though I configured it as utf-8 - both in stdout and file write - odd.

fixed: must pass ensure_ascii=False to json.dumps() to avoid losing UTF8 encoding to ASCII:

x = json.dumps(s, sort_keys=True, indent=1, ensure_ascii=False)

before I found this solution, this workaround did the trick (posted here for posterity):

x = json.dumps(s, sort_keys=True, indent=1)
x = x.encode('ascii', 'backslashreplace')
x = x.decode('unicode_escape')

jeremy · August 16, 2018, 7:16pm

Works great for me! I’ve replaced the old one and removed the jq one now.

lesscomfortable · August 21, 2018, 12:42am

Hey @stas,

I am trying the new script and am getting

Traceback (most recent call last):
  File "tools/fastai-nbstripout", line 77, in <module>
    s = json.load(input_stream)
  File "/home/user/anaconda3/envs/fastai/lib/python3.6/json/__init__.py", line 299, in load
    parse_constant=parse_constant, object_pairs_hook=object_pairs_hook, **kw)
  File "/home/user/anaconda3/envs/fastai/lib/python3.6/json/__init__.py", line 354, in loads
    return _default_decoder.decode(s)
  File "/home/user/anaconda3/envs/fastai/lib/python3.6/json/decoder.py", line 339, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/home/user/anaconda3/envs/fastai/lib/python3.6/json/decoder.py", line 355, in raw_decode
    obj, end = self.scan_once(s, idx)
json.decoder.JSONDecodeError: Expecting ',' delimiter: line 371 column 5 (char 11954)

Do you have any clue how to fix it? I would think it’s a bug but since you and Jeremy made it work fine I guess it might be another (local) problem.

Thanks!

jeremy · August 21, 2018, 4:45am

@lesscomfortable try running it manually on a notebook. Does this happen on all notebooks? If not, which does it happen with?

stas · August 21, 2018, 6:16am

What Jeremy said, and also the trace shows that you’re sending JSON to the program via STDIN - what are you sending - is it a valid JSON? In order to help you we need to know what data you’re having a difficulty with. Specifically test:

cat somenb.ipynb | tools/fastai-nbstrip

which is the mode in which it’s failing for you. But do test:

tools/fastai-nbstrip somenb.ipynb

as well.

The new script could probably use a more user-friendly assert - so having your problematic input would be a good way to improve it. Perhaps you had a failed merge, which could break the nb.

lesscomfortable · August 21, 2018, 2:20pm

Yes, the problematic input is 001a_nn_basics.ipynb. I tried
cat somenb.ipynb | tools/fastai-nbstrip
on four other notebooks and it works correctly. I have not edited 001a_nn_basics.ipynb locally so I guess one of the last PR’s has created a JSON format problem in the file but I can’t find the bug (last commit was 2 days ago for that file)

Regarding

tools/fastai-nbstrip somenb.ipynb

I am getting

Traceback (most recent call last):
File “…/tools/fastai-nbstripout”, line 73, in
f.write(x+f.write(“\n”))
TypeError: must be str, not int

And it deletes everything I had in the input file.

Do you figure these two problems are related?

stas · August 21, 2018, 3:12pm

Someone committed a broken 001a_nn_basics.ipynb - it is now fixed - please git pull - thank you for catching that, @lesscomfortable.

That someone changed the notebook in the text editor and committed it without validating that it’s valid JSON (via loading in jupyter notebook) and clearly definitely not using the nbstripout filter, or it’d have croaked. Please let’s make sure that all committers use the filter.

btw, instead of loading in the jupyter notebook, there are all kinds of CLI tools to validate JSON, e.g.: https://jsonlint.com/

 jsonlint-php nb.ipynb

Do you figure these two problems are related?

Absolutely, it’s exactly the same, just one reads from STDIN another from file. So the other way should work now too.

lesscomfortable · August 21, 2018, 3:32pm

The fact that nbstripout runs automatically when committing makes sure that we will find bugs of this type quickly.

I am still getting the same problem when running

tools/fastai-nbstrip somenb.ipynb

I get:

Traceback (most recent call last):
File “…/tools/fastai-nbstripout”, line 73, in
f.write(x+f.write(“\n”))
TypeError: must be str, not int

And my file is empty after running it. Do you have any hypothesis on what might be causing this to happen?

stas · August 21, 2018, 3:51pm

I am still getting the same problem when running
tools/fastai-nbstrip somenb.ipynb

Yup, @jeremy broke it It’s fixed now (git pull)

Jeremy, wrt your attempted change: I already considered that: write(x+"\n") would very inefficient since it’ll need to re-copy a huge chunk of data to just add “\n” to it - hence the two write calls which I measured to be negligible as compared to a single write call. Unless you know that str1+str2 is not a hard copy in the latest python. Speed is very critical in this script.

I originally used write("".join([x, "\n"])) but it looked ugly. join is supposed to not copy.

thank you, @lesscomfortable

lesscomfortable · August 21, 2018, 3:53pm

Works, thank you Stas!

stas · August 21, 2018, 4:13pm

That unfortunately is not possible out of the box due to git’s security - so each developer needs to make sure this is indeed the case every time they make a new git clone. The instructions are here.

jeremy · August 21, 2018, 4:16pm

Many apologies @stas!

stas · August 21, 2018, 4:23pm

Not at all, @jeremy. All is good.

Can you please confirm that my understanding is correct - I’m new to python, so I just want to make sure we get the best speed there.

lesscomfortable · August 21, 2018, 5:02pm

I understand, what I meant is that while we have people using nbstripout committing every day we can catch these bugs quickly before they spread too much. Having said this, it is ideal (and expected) that every contributor uses nbstripout and we don’t need to trace back.

MicPie · August 21, 2018, 5:02pm

Sorry, I broke it. - I will use nbstripout in the future.

Best regards
Michael