Edit
I no longer clean my notebooks for git. I want the outputs saved so when I upload the notebooks, people can view the outputs on github. I now instead update .py files to save my work, and update notebooks when I’m ready to show people. With this workflow, no one needs to check the notebook diffs, and so I can choose to keep the outputs over keeping the notebook diffs clean.
Old post
Goal
To automatically clear outputs and metadata from Jupyter notebooks when adding them to a git repo.
Why
To make git commits and git diffs cleaner.
A solution
In a terminal, enter the following to install jq:
sudo apt-get install jq
“jq is a lightweight and flexible command-line JSON processor (‘sed for JSON data’).”
In your ~/.gitconfig file, add:
[core]
attributesfile = ~/.gitattributes_global
[filter "nbstrip_full"]
clean = "jq --indent 1 \
'(.cells[] | select(has(\"outputs\")) | .outputs) = [] \
| (.cells[] | select(has(\"execution_count\")) | .execution_count) = null \
| .metadata = {\"language_info\": {\"name\": \"python\", \"pygments_lexer\": \"ipython3\"}} \
| .cells[].metadata = {} \
'"
smudge = cat
required = true
This will define a JSON filter, named “nbstrip_full”, which will clear the outputs and metadata from the notebooks.
In your ~/.gitattributes_global file, add:
*.ipynb filter=nbstrip_full
This will cause git add to apply the nbstrip_full filter to notebooks when you use git add
on them.
Gotchas with this solution
- For pre-existing notebooks, consider doing a do-nothing commit to apply the filter
- This filter setup makes doing a rebase more difficult (see link below for details)
-
This filter setup is global. Unset the filter for a specific repo by adding
*.ipynb -filter
to the local .gitattributes file.
Credit
Tim Staley’s blog post: Making Git and Jupyter Notebooks play nice (Feb 2017)
This solution and the gotchas came from his post. See his post if you want to know why he uses this solution and how it works in more detail.
If anyone knows of a cleaner solution, please let me know.