Just stripped the docs nbs.
I’ve noticed that this strip out filter is causing merging/stashing problems:
git stash
Saved working directory and index state WIP on master: c945e18 - cleanup duplicated imports - use imported shortcut PIL.Image -> Image
git pull origin master
From github.com:fastai/fastai_v1
* branch master -> FETCH_HEAD
error: Your local changes to the following files would be overwritten by merge:
docs/fastai_v1.nb_001b.ipynb
docs/fastai_v1.nb_002.ipynb
docs/fastai_v1.nb_002b.ipynb
Please commit your changes or stash them before you merge.
Aborting
Note I haven’t touched those files under docs/.
I think git stash
relies on git diff to see what changes need to be stashed, and since git diff
w/ the filter returns nothing, nothing gets stashed. And then git pull
sees a conflict and aborts. Any idea how to resolve that?
I ended up overwriting docs
with git checkout docs
, but perhaps there is a better way.
Also you may have noticed that nbstripout is slow. This article suggests using jq, a lightweight and flexible command-line JSON processor, which is supposed to be much faster.
The author gives an example of the setup he uses right now to accomplish something similar to nbstripout:
jq --indent 1 \
'
(.cells[] | select(has("outputs")) | .outputs) = []
| (.cells[] | select(has("execution_count")) | .execution_count) = null
| .metadata = {"language_info": {"name":"python", "pygments_lexer": "ipython3"}}
| .cells[].metadata = {}
' 01-parsing.ipynb
Had the same problem, and basically removed everything before re-cloning.
Is it a one-time-thing because they were stripped recently?
The speed issue seems to be because nbformat/reader.py
imports json
. If you instead import ujson as json
it’ll go much faster.
A simple python file using dict comprehensions and ujson may get nearly all the way there.
I understand that you’re suggesting to rewrite nbstripout from scratch. I have a feeling the tricky part would be to keep the output format identical to nbformat.
For the heck of it I tried to hack ujson in place of json (since I can’t change nbformat/reader.py
). But it didn’t work - as ujson and json are different. I added in fastai-nbstripout:
import ujson
sys.modules['json'] = ujson
Traceback (most recent call last):
File "tools/fastai-nbstripout", line 120, in <module>
from nbformat import read, write, NO_CONVERT
File "/home/stas/anaconda3/envs/pytorch-dev/lib/python3.6/site-packages/nbformat/__init__.py", line 15, in <module>
from . import v2
File "/home/stas/anaconda3/envs/pytorch-dev/lib/python3.6/site-packages/nbformat/v2/__init__.py", line 27, in <module>
from .nbjson import reads as reads_json, writes as writes_json
File "/home/stas/anaconda3/envs/pytorch-dev/lib/python3.6/site-packages/nbformat/v2/nbjson.py", line 31, in <module>
class BytesEncoder(json.JSONEncoder):
AttributeError: module 'ujson' has no attribute 'JSONEncoder'
Reading some reports, it’s suggested that ujson is quite buggy. Instead it’s been suggested to use rapidjson. Here are some recent benchmarks.
I’ve spent much less time thinking about this than you, so I’m happy to take your word for it. I can’t see that jq would make this any easier…
I’m always a little out of date…
I’m not quite sure yet. This is a totally new setup for me and I have been trying to wrap my head around it. I think we are bound to hit ‘no free lunch’ here - since there will be always a merge if we commit only parts of the file and it has changed since it was checked out, so we can’t avoid a merge.
I think in this situation we need to disable .gitattributes
filters before doing git stash
, so that git stash
will be able to make a clean diff without any filters interfering. In my very brief experience with git, it appears that git pull
must succeed after git stash
and this is not the case right now.
So one easy way would be to run:
git config --local --unset include.path
which would turn off the stripping out setup. Then do the update/merge and then re-enable the stripping out again:
git config --local include.path '../.gitconfig'
If you find an easy solution please post. Meanwhile I’m experimenting with various ideas.
Thanks.
I tend to just do a local commit, instead of stashing. You can always revert it, or choose different merge strategies as required after pulling. Is that an option for you? (Obviously it would be better if stash worked - just looking for compromises, if they’re needed.)
Isn’t the default commit a local commit?
git commit
that’s the thing - git commit
followed by git push
aborts because HEAD has changed and trying to rebase the local repo via git pull
aborts because it can’t merge.
Perhaps you mean something different by a local commit.
I meant git commit, not followed by push. So if you git commit, then git pull, it works fine.
I haven’t tried that. I will give it a try next time I encounter the same. Thank you for the suggestion, Jeremy.
At this moment the quoted approach seems to be a working workaround, Jeremy’s suggestion didn’t work. Observe:
$ git checkout my_branch
error: Your local changes to the following files would be overwritten by checkout:
dev_nb/experiments/caltech101.ipynb
dev_nb/experiments/comp_gs_numpy_torch.ipynb
dev_nb/experiments/pipelines_speed_comp.ipynb
dev_nb/experiments/random_resize_crop.ipynb
Please commit your changes or stash them before you switch branches.
Aborting
(note that I haven’t touched dev_nb/experiments/*
- it thinks there are local changes since the version under git isn’t matching what it would have been after strip out)
$ git diff
$ git pull
Already up to date.
$ git config --local --unset include.path
$ git checkout my_branch
Switched to branch 'my_branch'
$ git config --local include.path '../.gitconfig'
$ git checkout my_branch
Already on 'my_branch'
and it works with stashing too in this situation (once the filter is disabled).
and now let’s fix the notebooks under dev_nb/experiments
:
# force manual strip out
tools/fastai-nbstripout dev_nb/experiments/*ipynb
# disable the strip out filter to get git to see changes
git config --local --unset include.path
git commit -a
git push
# restore the filter
git config --local include.path '../.gitconfig'
Voila, this shouldn’t be a problem anymore.
There should be no need to disable the filter ever again (unless we change fastai-nbstripout and then we will need to rerun all notebooks through it manually (see code above) so that git version matches the local stripped out version).
Just to sum things up - I’m very happy that we facilitated the strip out process - it’s so much easier now to commit/PR changes to the notebooks! Thank you for supporting this change.
I wrote a new stripout script tools/fastai-nbstripout-jq
that uses jq. It works about 10-20 times faster than nbstripout.
@jeremy, can you please have a look and to tell me whether you’re happy with it, and whether we want to keep both versions or kill the python version or just no need to decide right now?
To switch to the new version:
apt install jq
cd fastai_v1
git config --local --unset include.path
git config --local include.path '../.gitconfig-jq'
alternatively, this is a universal version which will choose the right config depending on whether jq is installed or not, so it’s suitable for scripting:
git config --local include.path '../.gitconfig'$(if command -v jq >/dev/null 2>&1; then echo -n -jq; fi);
Here is a quick benchmark:
Running on 3 docs notebooks (on purpose in non-docs mode so that the filter will have to do some work of stripping outputs and some metadata):
cd docs
mkdir test1
mkdir test2
cp fa*.ipynb test1
cp fa*.ipynb test2
# identical inputs
diff -ru test1 test2
time ../tools/fastai-nbstripout test1/*
real 0m0.223s
user 0m0.192s
sys 0m0.032s
time ../tools/fastai-nbstripout-jq test2/*
real 0m0.016s
user 0m0.013s
sys 0m0.003s
# identical outputs
diff -ru test1 test2
This is about 10-20 times faster, and now I can’t really see any delays whatsoever when using git.
For posterity (and those who search for similar solutions) here are the 2 filters - one took a bit of a long trial and error to figure out:
### filter for doc nbs ###
# 1. reset execution_count
# 2. keep only certain cell metadata fields
# 3. keep only certain nb metadata fields
# to add more metadata entries to keep do:
# if (.key == "key1" or .key == "key2")
filter_docs='
(.cells[] | select(has("execution_count")) | .execution_count) = null
| .cells[].metadata |= with_entries(
if (.key == "hide_input")
then . # keep the entry
else empty # delete the entry
end)
| .metadata = {"kernelspec": {"display_name": "Python 3", "language": "python", "name": "python3"}}
'
### filter for code nbs ###
# 1. reset execution_count
# 2. delete cell's outputs
# 3. delete cell's metadata
# 4. keep only certain nb metadata fields
filter_code='
(.cells[] | select(has("execution_count")) | .execution_count) = null
| (.cells[] | select(has("outputs")) | .outputs) = []
| .cells[].metadata = {}
| .metadata = {"kernelspec": {"display_name": "Python 3", "language": "python", "name": "python3"}}
'
and to run:
jq --indent 1 "$filter" "$file"
Nice one! I definitely have been wanting git to be faster in fastai_v1, so this is valuable. But we also need to ensure that the ux for new devs is smooth - if they don’t have jq, we don’t want odd errors to appear in their console.
I know I’ve asked this before, and apologies if I didn’t understand your answer properly, but your jq solution makes me wonder again whether that same filter couldn’t be achieved nearly as easily with python and rapidjson (and for those without rapidjson installed, we can easily have it fallback to json with try/catch)? Do you think jq is significantly faster than rapidjson (if it is, then my suggestion doesn’t really help)?
Well, they will still need to install external libraries, so pip --install whateverjson
, or apt install jq
isn’t very different.
But I hear you. I have a few other things to deal with first and then I will give it a try.
I think also that current nbstripout doesn’t identify a critical path of doing its real work, so there are all kinds of things happening in it, that aren’t needed during the stripout. All that install/uninstall craft should really be isolated from its real application. all those small overheads add up over many notebooks. Ideally, github would be running such filter as a daemon, so everything is loaded once and there is no loading overhead for each notebook.
Thank you for checking it out, Jeremy.
I don’t think that’s quite true - json
is stdlib, and we can have transparent fallback to it.
OK, so we weren’t on the same page. I thought you were thinking about a new python implementation using ujson/rapidjson.
Now you have clarified that you were thinking about a transparent fallback. As I said earlier it will take some workarounds since json and ujson for example don’t have fully identical API.
And as I wrote earlier, this is transparent fallback as well (albeit during config):
git config --local include.path '../.gitconfig'$(if command -v jq >/dev/null 2>&1; then echo -n -jq; fi);
They have to run this command line anyway on check out. So if they have jq they will get the faster version, otherwise it’ll be the slow one.
And now that you clarified what you had in mind I can look at implementing it.
Thank you.
You can get a dict from a json file using a consistent api between rapidjson and json afaict. The “incompatibility” section of their readme only mentions stuff that doesn’t seem to be needed for our purpose. So I’m thinking we just have something like:
try:
import rapidjson as json
except:
import json
d = json.load('...')
d = [munge(o) for o in d if do_keep(o)]
…and then munge
does whatever is needed to change a cell, and do_keep
returns True for those that shouldn’t be removed. Or something like that. I’d assume these would look quite similar to the jq
filter you have above. Does that make sense?
BTW one issue to be aware of: the shell 1-liner isn’t going to work on Windows (other than WSL, which isn’t usable for CUDA), and @sgugger and I are both developing on that platform - and we’re trying to ensure we provide good Windows support since it’s widely used in some places still.