jq works fine on windows, thanks for checking. The shells available on windows are powershell and cmd - most people use both. (There’s also bash, but it isn’t fully integrated with standard windows stuff.)
@stas: I’ve not played with jq filters before so I had a quick look to see if they seem how they compare to comprehensions. It was fun! I had a quick try at replicating your jq filter and I think this does the same thing (not well tested):
try: import rapidjson as json
except: import json
def clean_cell(o, code=True):
o.pop('execution_count',None)
o['metadata'] = {k:v for k,v in o['metadata'] if k=='hide_input'}
if code: o.pop('outputs',None)
return o
def clean_nb(s, code=True):
s['cells'] = [clean_cell(o, code) for o in s['cells']]
s['metadata'] = {k:v for k,v in metadata.items() if k=='kernelspec'}
s = json.load(open('jsontest.ipynb'))
clean_nb(s, True)
(One minor difference - I figured it’s fine to leave ‘hide_input’ in code notebooks too.)
What do you think of this approach overall? I haven’t checked to see whether it’s actually performant, although I’m guessing it should be OK since the main time is presumably spent on json parsing, which rapidjson seems pretty good at.
Thank you for the code, Jeremy. That was useful. I’m on it.
First, ujson
is out of the game since it doesn’t support the separators
argument in dumps()
. And its output format is different from what jupyter
or json
generates. rapidjson
doesn’t support that argument either, but it generates json
's default format which matches what we need. Observe:
import ujson, rapidjson, json
i = '{ "a": [ {"c":1, "d": [] } ], "b": {} }'
print("\njson:", json.dumps(json.loads(i), separators=(',', ': '), sort_keys=True, indent=1))
print("\nrapidjson:", rapidjson.dumps(ujson.loads(i), sort_keys=True, indent=1))
print("\nujson:", ujson.dumps(ujson.loads(i), sort_keys=True, indent=1))
gets:
json: {
"a": [
{
"c": 1,
"d": []
}
],
"b": {}
}
rapidjson: {
"a": [
{
"c": 1,
"d": []
}
],
"b": {}
}
ujson: {
"a":[
{
"c":1,
"d":[
]
}
],
"b":{
}
}
The new python version is here. @jeremy, @sgugger please kindly check that it works for you.
I assume you already have done:
git config --local include.path '../.gitconfig'
I will write more details in the next comment
tl;dr:
So, first a surprise, this new implementation is actually some 25-30% faster with json than rapidjson - I have no idea why. Unless I made some mistake in the code.
It also beats jq by 40% or so. whoah!
Here are the full benchmarks:
Benchmark the new nbstripout script once with json, then with rapidjson
rm -r json rapid
mkdir json rapid
cp dev_nb/*ipynb json
cp dev_nb/*ipynb rapid
# insert fake input that we will then strip - to actually test performance
perl -pi -e 's|"execution_count": null,|"execution_count": 99,|' json/* rapid/* jq/*
perl -pi -e 's|"outputs": \[\],|"outputs": [ { "data": { "text/html": [ "<br/>" ], "text/plain": [ "x" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "blah()" ],|' json/* rapid/* jq/*
perl -pi -e 's|"metadata": \{\},|"metadata": {"collapsed": true},|' json/* rapid/* jq/*
echo "new json"; time tools/fastai-nbstripout-fast json/*
echo "new rapidjson"; time tools/fastai-nbstripout-fast -r rapid/*
diff -ru json rapid
new json:
real 0m0.042s
user 0m0.038s
sys 0m0.004s
new rapidjson:
real 0m0.068s
user 0m0.054s
sys 0m0.014s
I will remove -r
option which loads rapidjson
instead of json
once others confirm as well that json beats rapidjson in this setup.
Benchmark the current nbstripout vs new python version vs jq script
rm -r json rapid jq
mkdir json rapid jq
cp dev_nb/*ipynb json
cp dev_nb/*ipynb rapid
cp dev_nb/*ipynb jq
# insert fake input that we will then strip - to actually test performance
perl -pi -e 's|"execution_count": null,|"execution_count": 99,|' json/* rapid/* jq/*
perl -pi -e 's|"outputs": \[\],|"outputs": [ { "data": { "text/html": [ "<br/>" ], "text/plain": [ "x" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "blah()" ],|' json/* rapid/* jq/*
perl -pi -e 's|"metadata": \{\},|"metadata": {"collapsed": true},|' json/* rapid/* jq/*
echo "current"; time tools/fastai-nbstripout json/*
echo "new"; time tools/fastai-nbstripout-fast rapid/*
echo "jq"; time tools/fastai-nbstripout-jq jq/*
diff -ru json rapid
diff -ru json jq
current:
real 0m0.613s
user 0m0.585s
sys 0m0.028s
new:
real 0m0.046s
user 0m0.037s
sys 0m0.008s
jq:
real 0m0.065s
user 0m0.055s
sys 0m0.010s
So this new script is now about 13 times faster than nbconvert and 1.4 times faster than jq (on the dev_nb/*ipynb notebooks).
I’m not surprised about nbconvert’s slowness - I looked through its code and saw that it loads a motherload of things to support a gaziilion of different variations of various nb formats.
And of course the new script doesn’t do any of the install/uninstall handling - we already have it preconfigured.
Further, Jeremy suggested to drop argparse
- but I will leave it for now as is. Unless someone beats me to it, until we have so many notebooks that using git gets slow again I don’t see a reason for that.
If we will really need very high performance, it’d probably be much faster to run nbstripout
as a daemon and not re-start it for each file as git does now.
The only difference I noticed so far with the new implementation is that utf8 characters got encoded:
- fill:str='█'):\n",
+ fill:str='\u2588'):\n",
Original:
cat dev_nb/004b_alternative_progress_bar.ipynb | grep print_progress
"def print_progress(iteration:int, total:int, prefix:str='', suffix:str='', decimals:int=1, length:int=50, fill:str='█'):\n",
With old script:
tools/fastai-nbstripout -t dev_nb/004b_alternative_progress_bar.ipynb | grep print_progress
"def print_progress(iteration:int, total:int, prefix:str='', suffix:str='', decimals:int=1, length:int=50, fill:str='█'):\n",
With new script:
tools/fastai-nbstripout-fast -t dev_nb/004b_alternative_progress_bar.ipynb | grep print_progress
"def print_progress(iteration:int, total:int, prefix:str='', suffix:str='', decimals:int=1, length:int=50, fill:str='\u2588'):\n",
Not sure why. For some reason my output is ASCII and not UTF8, even though I configured it as utf-8 - both in stdout and file write - odd.
fixed: must pass ensure_ascii=False to json.dumps() to avoid losing UTF8 encoding to ASCII:
x = json.dumps(s, sort_keys=True, indent=1, ensure_ascii=False)
before I found this solution, this workaround did the trick (posted here for posterity):
x = json.dumps(s, sort_keys=True, indent=1)
x = x.encode('ascii', 'backslashreplace')
x = x.decode('unicode_escape')
Works great for me! I’ve replaced the old one and removed the jq one now.
Hey @stas,
I am trying the new script and am getting
Traceback (most recent call last):
File "tools/fastai-nbstripout", line 77, in <module>
s = json.load(input_stream)
File "/home/user/anaconda3/envs/fastai/lib/python3.6/json/__init__.py", line 299, in load
parse_constant=parse_constant, object_pairs_hook=object_pairs_hook, **kw)
File "/home/user/anaconda3/envs/fastai/lib/python3.6/json/__init__.py", line 354, in loads
return _default_decoder.decode(s)
File "/home/user/anaconda3/envs/fastai/lib/python3.6/json/decoder.py", line 339, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/home/user/anaconda3/envs/fastai/lib/python3.6/json/decoder.py", line 355, in raw_decode
obj, end = self.scan_once(s, idx)
json.decoder.JSONDecodeError: Expecting ',' delimiter: line 371 column 5 (char 11954)
Do you have any clue how to fix it? I would think it’s a bug but since you and Jeremy made it work fine I guess it might be another (local) problem.
Thanks!
@lesscomfortable try running it manually on a notebook. Does this happen on all notebooks? If not, which does it happen with?
What Jeremy said, and also the trace shows that you’re sending JSON to the program via STDIN - what are you sending - is it a valid JSON? In order to help you we need to know what data you’re having a difficulty with. Specifically test:
cat somenb.ipynb | tools/fastai-nbstrip
which is the mode in which it’s failing for you. But do test:
tools/fastai-nbstrip somenb.ipynb
as well.
The new script could probably use a more user-friendly assert - so having your problematic input would be a good way to improve it. Perhaps you had a failed merge, which could break the nb.
Yes, the problematic input is 001a_nn_basics.ipynb. I tried
cat somenb.ipynb | tools/fastai-nbstrip
on four other notebooks and it works correctly. I have not edited 001a_nn_basics.ipynb locally so I guess one of the last PR’s has created a JSON format problem in the file but I can’t find the bug (last commit was 2 days ago for that file)
Regarding
tools/fastai-nbstrip somenb.ipynb
I am getting
Traceback (most recent call last):
File “…/tools/fastai-nbstripout”, line 73, in
f.write(x+f.write(“\n”))
TypeError: must be str, not int
And it deletes everything I had in the input file.
Do you figure these two problems are related?
Someone committed a broken 001a_nn_basics.ipynb - it is now fixed - please git pull
- thank you for catching that, @lesscomfortable.
That someone changed the notebook in the text editor and committed it without validating that it’s valid JSON (via loading in jupyter notebook) and clearly definitely not using the nbstripout filter, or it’d have croaked. Please let’s make sure that all committers use the filter.
btw, instead of loading in the jupyter notebook, there are all kinds of CLI tools to validate JSON, e.g.: https://jsonlint.com/
jsonlint-php nb.ipynb
Do you figure these two problems are related?
Absolutely, it’s exactly the same, just one reads from STDIN another from file. So the other way should work now too.
The fact that nbstripout runs automatically when committing makes sure that we will find bugs of this type quickly.
I am still getting the same problem when running
tools/fastai-nbstrip somenb.ipynb
I get:
Traceback (most recent call last):
File “…/tools/fastai-nbstripout”, line 73, in
f.write(x+f.write(“\n”))
TypeError: must be str, not int
And my file is empty after running it. Do you have any hypothesis on what might be causing this to happen?
I am still getting the same problem when running
tools/fastai-nbstrip somenb.ipynb
Yup, @jeremy broke it It’s fixed now (
git pull
)
Jeremy, wrt your attempted change: I already considered that: write(x+"\n")
would very inefficient since it’ll need to re-copy a huge chunk of data to just add “\n” to it - hence the two write calls which I measured to be negligible as compared to a single write call. Unless you know that str1+str2
is not a hard copy in the latest python. Speed is very critical in this script.
I originally used write("".join([x, "\n"]))
but it looked ugly. join is supposed to not copy.
thank you, @lesscomfortable
Works, thank you Stas!
That unfortunately is not possible out of the box due to git’s security - so each developer needs to make sure this is indeed the case every time they make a new git clone
. The instructions are here.
Many apologies @stas!
Not at all, @jeremy. All is good.
Can you please confirm that my understanding is correct - I’m new to python, so I just want to make sure we get the best speed there.
I understand, what I meant is that while we have people using nbstripout committing every day we can catch these bugs quickly before they spread too much. Having said this, it is ideal (and expected) that every contributor uses nbstripout and we don’t need to trace back.
Sorry, I broke it. - I will use nbstripout in the future.
Best regards
Michael