Fastai-nbstripout: stripping notebook outputs and metadata for git storage

@lesscomfortable try running it manually on a notebook. Does this happen on all notebooks? If not, which does it happen with?

1 Like

What Jeremy said, and also the trace shows that you’re sending JSON to the program via STDIN - what are you sending - is it a valid JSON? In order to help you we need to know what data you’re having a difficulty with. Specifically test:

cat somenb.ipynb | tools/fastai-nbstrip

which is the mode in which it’s failing for you. But do test:

tools/fastai-nbstrip somenb.ipynb

as well.

The new script could probably use a more user-friendly assert - so having your problematic input would be a good way to improve it. Perhaps you had a failed merge, which could break the nb.

Yes, the problematic input is 001a_nn_basics.ipynb. I tried
cat somenb.ipynb | tools/fastai-nbstrip
on four other notebooks and it works correctly. I have not edited 001a_nn_basics.ipynb locally so I guess one of the last PR’s has created a JSON format problem in the file but I can’t find the bug (last commit was 2 days ago for that file)

Regarding

tools/fastai-nbstrip somenb.ipynb

I am getting

Traceback (most recent call last):
File “…/tools/fastai-nbstripout”, line 73, in
f.write(x+f.write("\n"))
TypeError: must be str, not int

And it deletes everything I had in the input file.

Do you figure these two problems are related?

Someone committed a broken 001a_nn_basics.ipynb - it is now fixed - please git pull - thank you for catching that, @lesscomfortable.

That someone changed the notebook in the text editor and committed it without validating that it’s valid JSON (via loading in jupyter notebook) and clearly definitely not using the nbstripout filter, or it’d have croaked. Please let’s make sure that all committers use the filter.

btw, instead of loading in the jupyter notebook, there are all kinds of CLI tools to validate JSON, e.g.: https://jsonlint.com/

 jsonlint-php nb.ipynb

Do you figure these two problems are related?

Absolutely, it’s exactly the same, just one reads from STDIN another from file. So the other way should work now too.

1 Like

The fact that nbstripout runs automatically when committing makes sure that we will find bugs of this type quickly.

I am still getting the same problem when running

tools/fastai-nbstrip somenb.ipynb

I get:

Traceback (most recent call last):
File “…/tools/fastai-nbstripout”, line 73, in
f.write(x+f.write("\n"))
TypeError: must be str, not int

And my file is empty after running it. Do you have any hypothesis on what might be causing this to happen?

I am still getting the same problem when running
tools/fastai-nbstrip somenb.ipynb

Yup, @jeremy broke it :wink: It’s fixed now (git pull)

Jeremy, wrt your attempted change: I already considered that: write(x+"\n") would very inefficient since it’ll need to re-copy a huge chunk of data to just add “\n” to it - hence the two write calls which I measured to be negligible as compared to a single write call. Unless you know that str1+str2 is not a hard copy in the latest python. Speed is very critical in this script.

I originally used write("".join([x, "\n"])) but it looked ugly. join is supposed to not copy.

thank you, @lesscomfortable

1 Like

Works, thank you Stas!

1 Like

That unfortunately is not possible out of the box due to git’s security - so each developer needs to make sure this is indeed the case every time they make a new git clone. The instructions are here.

Many apologies @stas!

Not at all, @jeremy. All is good.

Can you please confirm that my understanding is correct - I’m new to python, so I just want to make sure we get the best speed there.

I understand, what I meant is that while we have people using nbstripout committing every day we can catch these bugs quickly before they spread too much. Having said this, it is ideal (and expected) that every contributor uses nbstripout and we don’t need to trace back.

1 Like

Sorry, I broke it. - I will use nbstripout in the future.

Best regards
Michael

All is good :slight_smile:
It will happen automatically once you do this, Michael.

1 Like

This is indeed an excellent side effect of us using nbstripout, Francisco.

2 Likes

I think you’re right. Python strings are copy-on-write.

1 Like

Thank you all for your work on this, especially @stas.

I have wanted to solve exactly this problem for a repository of notebooks for quite a while and haven’t found anything that worked well for us. I saw Jeremy tweet about this a little while ago and finally had time to try it out today.

This was easy to understand, easy to setup in our repo, and worked perfectly out of the box.

The only thing I could possibly ask for at this point is figuring out a way to avoid each user having to run git config --local include.path '../.gitconfig' on checkout, but it sounds like that’s a security brick wall, which I get.

Thanks again!

1 Like

Since we can’t avoid this step due to git’s security, I guess the only thing we can do to make it simpler is to make it into a script, like tools/trust-origin-git-config:

#/bin/sh
# XXX: first check we are inside a repo, then enable
git config --local include.path '../.gitconfig'

So the sequence would be:

git clone git@github.com:fastai/fastai_v1.git 
cd fastai_v1
tools/trust-origin-git-config

or perhaps it can be called 'tools/run-upon-clone`, but it’s less explicit about its function. Other ideas are welcome.

p.s. @jeremy, what do you think? should we add this? I suppose implementing it in python would be better so that it works on windows.

That makes sense, it’s quite similar to what I am thinking about doing for my project (a company-internal collection of exploratory notebooks and tools).

In our case, we have a couple of similar setup steps that users likely want to run upon clone. My current plan is to bundle them all up into a single tools/bootstrap script that they need to run once, after which they are all set for both this and our other config details.

1 Like

Has there been some changes in the git? I noticed the last notebook I pushed wasn’t stripped.

No change. Perhaps you made a new clone and forgot to tell git to trust .gitconfig? check .git/config and see whether you have this entry:

[include]
        path = ../.gitconfig

While we can’t do the trusting step on behalf of the user, perhaps we can use some git hook to validate that it’s configured and prevent git push without it?