Developer chat


(Fred Monroe) #510

i recreated the bug, will investigate

big clue imho:
ERROR: Unexpected segmentation fault encountered in worker.
ERROR: Unexpected segmentation fault encountered in worker.
ERROR: Unexpected segmentation fault encountered in worker.
ERROR: Unexpected segmentation fault encountered in worker.


(Stas Bekman) #511

Brilliant, Fred, looking forward to hearing about the details when you have time to share more.


(Fred Monroe) #512

setting num_workers = 0 makes the problem disappear, going to try to catch and print exceptions in worker thread


(Stas Bekman) #513

But before you find a fix, how did you reproduce it?


(Fred Monroe) #514

i removed anaconda python from my path (.bash_profile)
installed python 3.6 from python website
followed the install script replacing python with python3 and pip with pip3 (bc on mac python 2.7 is builtin)
ran -e .[dev] and pytest from a freshly git cloned repo

previously i was on anaconda, didn’t realize that would mask problem


(Stas Bekman) #515

Excellent. So most likely it’s not a difference in physical drives then. Perhaps conda build ends up fetching a different mix of dependent packages than pypi. I will compare that next.


(Fred Monroe) #516

unfortunately i’m out on time right now - but what i have figured out is that the problem arises while it is iterating across the tfms (around line 110 in fastai/vision/image.py) , it segfaults sometime during that loop

maybe one of the tfms isn’t threadsafe


(Stas Bekman) #517

Possible, but first, I now have a diff of package versions between two environments. It’d be good to compare apples to apples. So here are the packages conda vs pip that have different versions upon install:

-matplotlib==3.0.1
+matplotlib==3.0.2
-regex==2018.08.29
+regex==2018.11.22
-typing==3.6.4
+typing==3.6.6
-urllib3==1.23
+urllib3==1.24.1
-wheel==0.32.3
+wheel==0.31.1

plus there is a list of packages that only conda has, but I don’t think these are relevant.


(Fred Monroe) #518

ok i don’t totally understand why yet but the transform that is causing a seg fault is symmetric_warp

changing line 86 of test_vision_data_block.py to:

x_tfms = get_transforms(max_warp=0) # turn off symmetric_warp

will prevent the errors


(Stas Bekman) #519

Thank you, Fred, for finding how to work around it. This is good, but it will just mask the problem.

And after many experiments on CI, I finally tracked it down.

It’s the pypi build for torch/torchvision that causes the segfault. conda build of the same works just fine. Seems to happen only on osx.

So rather than trying to fix the fastai code - we need to come up with a reproducable test case that we can give to pytorch developers so that they could fix it.

@313V, do you think you could do it? taking that test as it is while it segfaults and replacing it with a pure pytorch code that still segfaults?

I also asked here, if perhaps that segfault/trace was sufficient for them to know where the problem is in pytorch.


#520

Breaking change if you had already read the tutorial to create a custom dataset, the methods show_xys and show_xyzs have moved at the ItemList level. The tutorial has been adapted in consequence.


(Stas Bekman) #521

@313V, can you get a stack trace for the segfault? Apparently pypi and conda pytorch packages are linked against a different BLAS implementation.

There are a few resources that explain how to do it: 1, 2, 3.

First, I’d recommend to get the core-dump and see whether it has any useful trace information. If not, it might require a build with extra gcc flags. But let’s hope we won’t need to go there.

If you do get the stack trace you can post it directly here https://github.com/pytorch/pytorch/issues/14359

Thank you for your help.


(Fred Monroe) #522

posting there now


#523

fastai v1.0.29 was just released with a lot of new stuff. See the changelog for all the details.

In parallel, fastprogress v0.1.16 was also released, for cosmetic purposes: since the progress bar isn’t a widget anymore, we can leave the HTML state in the output. @stas I believe this also solves your clear_output problem.


(Stas Bekman) #524

That’s wonderful. Thank you, @sgugger!


(Rubén Chaves) #525

I found a little issue. On tabular data, if a column has not any nan on the training set and the same column have a missing value on the test set, the nan of the test set is not changed and on test time the model outputs nan.

A little example: https://colab.research.google.com/drive/1js7ufUeNTMA2HmZ471zjbPvd4g5oSuAn


#526

New functionality: the optimizer isn’t reset at each new call to fit (which forgets the state) and learn.save, learn.load will save and load the optimizer state as well as the model.
learn.load is backward compatible, and will work on old saved models (without loading any optimizer state obviously).
If someone really wants to reset the optimizer, they should call learn.opt.clear().


#527

Thanks for flagging. We discussed with Jeremy, but since we can’t treat those NaNs as the others (there won’t be a nan column for instance) we’ll throw an exception for now and it’s going to be to the user to fix it.


(Fred Monroe) #528

nice work keeping optimizer state, thats a nice improvement for when iteratively fitting in a notebook


(Michael) #529

After my last git pull I get the following error when I start training:

~/anaconda3/lib/python3.6/site-packages/fastprogress/fastprogress.py in write(self, line, table)
    207         if not table: self.text += line + "<p>"
    208         else:
--> 209             self.raw_text += line + "\n"
    210             self.text = text2html_table(self.raw_text)
    211 

TypeError: can only concatenate list (not "str") to list

Is this error reproducible (or is it just my machine)?