Developer chat

stas · November 26, 2018, 11:20pm

Excellent. So most likely it’s not a difference in physical drives then. Perhaps conda build ends up fetching a different mix of dependent packages than pypi. I will compare that next.

313V · November 27, 2018, 12:08am

unfortunately i’m out on time right now - but what i have figured out is that the problem arises while it is iterating across the tfms (around line 110 in fastai/vision/image.py) , it segfaults sometime during that loop

maybe one of the tfms isn’t threadsafe

stas · November 27, 2018, 12:48am

Possible, but first, I now have a diff of package versions between two environments. It’d be good to compare apples to apples. So here are the packages conda vs pip that have different versions upon install:

-matplotlib==3.0.1
+matplotlib==3.0.2
-regex==2018.08.29
+regex==2018.11.22
-typing==3.6.4
+typing==3.6.6
-urllib3==1.23
+urllib3==1.24.1
-wheel==0.32.3
+wheel==0.31.1

plus there is a list of packages that only conda has, but I don’t think these are relevant.

313V · November 27, 2018, 1:32am

ok i don’t totally understand why yet but the transform that is causing a seg fault is symmetric_warp

changing line 86 of test_vision_data_block.py to:

x_tfms = get_transforms(max_warp=0) # turn off symmetric_warp

will prevent the errors

stas · November 27, 2018, 5:13am

Thank you, Fred, for finding how to work around it. This is good, but it will just mask the problem.

And after many experiments on CI, I finally tracked it down.

It’s the pypi build for torch/torchvision that causes the segfault. conda build of the same works just fine. Seems to happen only on osx.

So rather than trying to fix the fastai code - we need to come up with a reproducable test case that we can give to pytorch developers so that they could fix it.

@313V, do you think you could do it? taking that test as it is while it segfaults and replacing it with a pure pytorch code that still segfaults?

I also asked here, if perhaps that segfault/trace was sufficient for them to know where the problem is in pytorch.

sgugger · November 27, 2018, 3:44pm

Breaking change if you had already read the tutorial to create a custom dataset, the methods show_xys and show_xyzs have moved at the ItemList level. The tutorial has been adapted in consequence.

stas · November 27, 2018, 5:13pm

@313V, can you get a stack trace for the segfault? Apparently pypi and conda pytorch packages are linked against a different BLAS implementation.

There are a few resources that explain how to do it: 1, 2, 3.

First, I’d recommend to get the core-dump and see whether it has any useful trace information. If not, it might require a build with extra gcc flags. But let’s hope we won’t need to go there.

If you do get the stack trace you can post it directly here https://github.com/pytorch/pytorch/issues/14359

Thank you for your help.

313V · November 27, 2018, 6:53pm

posting there now

sgugger · November 27, 2018, 11:32pm

fastai v1.0.29 was just released with a lot of new stuff. See the changelog for all the details.

In parallel, fastprogress v0.1.16 was also released, for cosmetic purposes: since the progress bar isn’t a widget anymore, we can leave the HTML state in the output. @stas I believe this also solves your clear_output problem.

stas · November 27, 2018, 11:37pm

That’s wonderful. Thank you, @sgugger!

Rvbens · November 29, 2018, 12:50am

I found a little issue. On tabular data, if a column has not any nan on the training set and the same column have a missing value on the test set, the nan of the test set is not changed and on test time the model outputs nan.

A little example: https://colab.research.google.com/drive/1js7ufUeNTMA2HmZ471zjbPvd4g5oSuAn

sgugger · November 29, 2018, 4:00pm

New functionality: the optimizer isn’t reset at each new call to fit (which forgets the state) and learn.save, learn.load will save and load the optimizer state as well as the model.
learn.load is backward compatible, and will work on old saved models (without loading any optimizer state obviously).
If someone really wants to reset the optimizer, they should call learn.opt.clear().

sgugger · November 29, 2018, 4:16pm

Thanks for flagging. We discussed with Jeremy, but since we can’t treat those NaNs as the others (there won’t be a nan column for instance) we’ll throw an exception for now and it’s going to be to the user to fix it.

313V · December 1, 2018, 4:05pm

nice work keeping optimizer state, thats a nice improvement for when iteratively fitting in a notebook

MicPie · December 5, 2018, 7:47pm

After my last git pull I get the following error when I start training:

~/anaconda3/lib/python3.6/site-packages/fastprogress/fastprogress.py in write(self, line, table)
    207         if not table: self.text += line + "<p>"
    208         else:
--> 209             self.raw_text += line + "\n"
    210             self.text = text2html_table(self.raw_text)
    211 

TypeError: can only concatenate list (not "str") to list

Is this error reproducible (or is it just my machine)?

ABertl · December 5, 2018, 8:55pm

I’m getting the same thing.

ABertl · December 5, 2018, 9:10pm

Ok found the problem. Needed to update fastprogress, as the way it’s being interacted with in the fast.ai code has changed. @MicPie

sgugger · December 5, 2018, 9:43pm

Weird, how did you update your library? Normally conda would have updated fastprogress for you.

ABertl · December 5, 2018, 10:05pm

I used pip to upgrade fastprogress directly.

I’m embarrassed to say, I don’t know how to resolve an impasse I currently have between jupyter_contrib_nbextensions and pytorch. Until I can get that sorted out, conda env update isn’t working.

MicPie · December 6, 2018, 8:40pm

I am using the developer install and I could fix it with conda install -c fastai fastprogress to get the newest fastprogress version. Thanks for the fast help.