I wrote a new stripout script tools/fastai-nbstripout-jq
that uses jq. It works about 10-20 times faster than nbstripout.
@jeremy, can you please have a look and to tell me whether you’re happy with it, and whether we want to keep both versions or kill the python version or just no need to decide right now?
To switch to the new version:
apt install jq
cd fastai_v1
git config --local --unset include.path
git config --local include.path '../.gitconfig-jq'
alternatively, this is a universal version which will choose the right config depending on whether jq is installed or not, so it’s suitable for scripting:
git config --local include.path '../.gitconfig'$(if command -v jq >/dev/null 2>&1; then echo -n -jq; fi);
Here is a quick benchmark:
Running on 3 docs notebooks (on purpose in non-docs mode so that the filter will have to do some work of stripping outputs and some metadata):
cd docs
mkdir test1
mkdir test2
cp fa*.ipynb test1
cp fa*.ipynb test2
# identical inputs
diff -ru test1 test2
time ../tools/fastai-nbstripout test1/*
real 0m0.223s
user 0m0.192s
sys 0m0.032s
time ../tools/fastai-nbstripout-jq test2/*
real 0m0.016s
user 0m0.013s
sys 0m0.003s
# identical outputs
diff -ru test1 test2
This is about 10-20 times faster, and now I can’t really see any delays whatsoever when using git.
For posterity (and those who search for similar solutions) here are the 2 filters - one took a bit of a long trial and error to figure out:
### filter for doc nbs ###
# 1. reset execution_count
# 2. keep only certain cell metadata fields
# 3. keep only certain nb metadata fields
# to add more metadata entries to keep do:
# if (.key == "key1" or .key == "key2")
filter_docs='
(.cells[] | select(has("execution_count")) | .execution_count) = null
| .cells[].metadata |= with_entries(
if (.key == "hide_input")
then . # keep the entry
else empty # delete the entry
end)
| .metadata = {"kernelspec": {"display_name": "Python 3", "language": "python", "name": "python3"}}
'
### filter for code nbs ###
# 1. reset execution_count
# 2. delete cell's outputs
# 3. delete cell's metadata
# 4. keep only certain nb metadata fields
filter_code='
(.cells[] | select(has("execution_count")) | .execution_count) = null
| (.cells[] | select(has("outputs")) | .outputs) = []
| .cells[].metadata = {}
| .metadata = {"kernelspec": {"display_name": "Python 3", "language": "python", "name": "python3"}}
'
and to run:
jq --indent 1 "$filter" "$file"