[Adv] Significant changes to fastai just pushed

dillon · November 13, 2017, 6:24pm

@jeremy it looks like a few people have seen similar issues and it boils down to import order. Still not 100% sure why that is.

i.e. https://github.com/facebookresearch/ParlAI/issues/331 and https://github.com/pytorch/pytorch/issues/2083

In any case, I resolved it by importing torch before anything else. which seems to do the trick.

jeremy · November 13, 2017, 9:09pm

Thanks @dillon - confirmed that works for me. Would be nice to find a better fix eventually!

dillon · November 13, 2017, 9:29pm

definitely! I’m still digging in. It seems to be a very odd bug. Will post back to this thread with any findings. Thanks for the help

jeremy · November 13, 2017, 9:35pm

@dillon when I run the first in lesson1 learn.fit line on paperspace it totally freezes for me. Does it work for you?

miguel_perez · November 13, 2017, 9:44pm

Hi, same problems here, (I just logged in and shift + entered to see if it worked).

Imports dont give errors with the fix but first chunk with learn.fit gives this error that was not before:

jeremy · November 13, 2017, 10:53pm

@miguel_perez try deleting your tmp folder, git pull, and conda env update.

miguel_perez · November 14, 2017, 8:32am

@jeremy did that now, + added source activate fastai to .bashrc (In AWS it was there already but not in my Paperspace).

result: error is gone, but learn.fit freezes forever, just as you described

jeremy · November 14, 2017, 2:03pm

@dillon just a note to see above from @miguel_perez - we’re both finding the freezing issue on Paperspace.

dillon · November 14, 2017, 6:50pm

@jeremy @miguel_perez For some reason I cannot repro on a new fast.ai instance. What I did:

git pull
conda env update
source activate fastai
add import torch above all other lines

image.jpg1107×158 62.8 KB
Manually created a /cache/tmp directory and changed

os.makedirs('/cache/tmp', exist_ok=True)
!ln -fs /cache/tmp {PATH}

to

os.makedirs('./cache/tmp', exist_ok=True)
!ln -fs /cache/tmp {PATH}

(missing period on my system but maybe not relevant)

run through cells.

miguel_perez · November 14, 2017, 8:09pm

@dillon, reconected and tried again, even added the dot …

but no luck, same as before, kernel freezes there, forever.

jeremy · November 15, 2017, 2:09am

@dillon it seems it wasn’t really frozen forever - it was just taking a really long time to create the pytorch model for some reason. I’ve seen other mentions of that in this forum relating to paperspace. Eventually it finishes, and works fine. But restarting the notebook causes the extreme delay on model creation again.

I should take ~5 secs to create the model.

miguel_perez · November 17, 2017, 9:30am

@dillon, any news about this problem? I would like to get used to Paperspace during the course, really like the environment… its a pitty that course weeks are passing without using it.

miguel_perez · November 17, 2017, 10:09am

Hi @A_TF57, read your post as trying to do conda env update on Crestle. After git pull But weirdly,Anaconda doesn’t seem to be installed. Did you install it yourself?

And more in general, have you been succesfully using Crestle for lesson notebooks until now?

A_TF57 · November 17, 2017, 2:42pm

I did but seems I didn’t have to. If you read the FAQs, you can install extra packages using pip3 if you want to. Everything else is already setup for you to use.

Yes! I have been using it quite regularly.

miguel_perez · November 17, 2017, 3:23pm

Understood. I was wanting to do “conda env update” but I guess Crestle manages the environment by itself…

Your answer saved me from switching to the “install eveything” mode. I will try pip3 and see if I also can make everything work, thanks a lot @A_TF57

dillon · November 17, 2017, 10:50pm

Hi guys, I’m not able to reproduce the issue 100% of the time but it does seem to be some weird interaction with import order. I have also not been able to reliably repro the issue of the freezing model creation.

I’m having the team here build out a new template to see if we can get a fresh base to work from.

Thanks for being so patient!

jeremy · November 18, 2017, 12:30am

Many thanks for the update @dillon . It’s great to know that you’re working on it.

BTW pytorch 0.3 is about to come out. It has a massive improvement in group convolution speed, so you’ll definitely want to include it when you can!

dillon · November 20, 2017, 9:56pm

OK, so I spent the last few days building out different versions of the template with different drivers/ base OS/etc.

It seems that the dlopen: cannot load any more object with static TLS error only occurs on Ubuntu 14.04 and when I switched to a Ubuntu 16.04 template that issue went away entirely.
That said, the new 16.04 template is now freezing on the learn.fit() consistently just as you mentioned. What I notice is that the CPU is getting hit really hard so I’m going to have to dig in to what this method is doing. From the outside it almost looks to me like there is a memory leak in the method. Virtual memory use explodes as CPU starts to really heat up.

@jeremy just as you mentioned it eventually works but something doesn’t look right and it can take 5 minutes to get going. This is on a brand new Ubuntu 16.04 template where I installed new drivers, fresh Conda install, etc.

If anyone is interested in helping debug I am happy to share my new instance

ramesh · November 20, 2017, 10:08pm

Sure. I can give it a try. Most likely later in the evening today. I am not using Paperspace at the moment, but used it in the past and will most likely return to it after a few months. So ping me if I can help.

jeremy · November 20, 2017, 10:28pm

@dillon I was seeing freezes on constructing learn, not on fitting. It might be worth doing a git pull and trying again, because I switched from ProcessPool to ThreadPool last night so if it was a multiprocessing issue this might well fix it.

We’re not seeing the problem on any other platform BTW, so it’s something specific to your config. Are you using docker? There are various threads around about docker and pytorch IIRC…