[Adv] Significant changes to fastai just pushed

@miguel_perez try deleting your tmp folder, git pull, and conda env update.

@jeremy did that now, + added source activate fastai to .bashrc (In AWS it was there already but not in my Paperspace).

result: error is gone, but learn.fit freezes forever, just as you described :frowning_face:

@dillon just a note to see above from @miguel_perez - we’re both finding the freezing issue on Paperspace.

@jeremy @miguel_perez For some reason I cannot repro on a new fast.ai instance. What I did:

  1. git pull
  2. conda env update
  3. source activate fastai
  4. add import torch above all other lines
  5. Manually created a /cache/tmp directory and changed
os.makedirs('/cache/tmp', exist_ok=True)
!ln -fs /cache/tmp {PATH}

to

os.makedirs('./cache/tmp', exist_ok=True)
!ln -fs /cache/tmp {PATH}

(missing period on my system but maybe not relevant)

  1. run through cells.

@dillon, reconected and tried again, even added the dot :grinning:

but no luck, same as before, kernel freezes there, forever.

@dillon it seems it wasn’t really frozen forever - it was just taking a really long time to create the pytorch model for some reason. I’ve seen other mentions of that in this forum relating to paperspace. Eventually it finishes, and works fine. But restarting the notebook causes the extreme delay on model creation again.

I should take ~5 secs to create the model.

@dillon, any news about this problem? I would like to get used to Paperspace during the course, really like the environment… its a pitty that course weeks are passing without using it. :worried:

Hi @A_TF57, read your post as trying to do conda env update on Crestle. After git pull But weirdly,Anaconda doesn’t seem to be installed. Did you install it yourself?

And more in general, have you been succesfully using Crestle for lesson notebooks until now?

I did but seems I didn’t have to. If you read the FAQs, you can install extra packages using pip3 if you want to. Everything else is already setup for you to use.

Yes! I have been using it quite regularly. :slight_smile:

2 Likes

Understood. I was wanting to do “conda env update” but I guess Crestle manages the environment by itself…

Your answer saved me from switching to the “install eveything” mode. I will try pip3 and see if I also can make everything work, thanks a lot @A_TF57

1 Like

Hi guys, I’m not able to reproduce the issue 100% of the time but it does seem to be some weird interaction with import order. I have also not been able to reliably repro the issue of the freezing model creation.

I’m having the team here build out a new template to see if we can get a fresh base to work from.

Thanks for being so patient!

4 Likes

Many thanks for the update @dillon . It’s great to know that you’re working on it.

BTW pytorch 0.3 is about to come out. It has a massive improvement in group convolution speed, so you’ll definitely want to include it when you can!

OK, so I spent the last few days building out different versions of the template with different drivers/ base OS/etc.

  1. It seems that the dlopen: cannot load any more object with static TLS error only occurs on Ubuntu 14.04 and when I switched to a Ubuntu 16.04 template that issue went away entirely.
  2. That said, the new 16.04 template is now freezing on the learn.fit() consistently just as you mentioned. What I notice is that the CPU is getting hit really hard so I’m going to have to dig in to what this method is doing. From the outside it almost looks to me like there is a memory leak in the method. Virtual memory use explodes as CPU starts to really heat up.

@jeremy just as you mentioned it eventually works but something doesn’t look right and it can take 5 minutes to get going. This is on a brand new Ubuntu 16.04 template where I installed new drivers, fresh Conda install, etc.

If anyone is interested in helping debug I am happy to share my new instance :slight_smile:

Sure. I can give it a try. Most likely later in the evening today. I am not using Paperspace at the moment, but used it in the past and will most likely return to it after a few months. So ping me if I can help.

@dillon I was seeing freezes on constructing learn, not on fitting. It might be worth doing a git pull and trying again, because I switched from ProcessPool to ThreadPool last night so if it was a multiprocessing issue this might well fix it.

We’re not seeing the problem on any other platform BTW, so it’s something specific to your config. Are you using docker? There are various threads around about docker and pytorch IIRC…

Is this problem specific to paperspace V100 GPUs?

I have been able to successfully run my planet classification code on a p2.xlarge instance. However, when I run in on a p3.xlarge instance - Volta V100 GPU, I get this problem. The kernel just freezes. The GPU memory is allocated however the code doesn’t get executed.


Upon further investigation, it freezes at construction of learner object.

The same code works smoothly on a p2.xlarge instance though. Any thoughts on this?

1 Like

Damn! It works now on V100 GPU.

Some points to note:

  1. learner object creation takes a very long time on V100 (p3). (On a p2 instance, it doesn’t take more than 5 seconds)
  2. Infact more than 8-10 minutes.
  3. fit function also takes time the first time it’s run. And it’s faaaaasttt!

When I set precompute=True, the speed improvement for 1 epoch on V100 is around 13%.
When I set precompute=False, the speed improvement for 1 epoch on V100 is around 63%.

Attached screenshots for reference.

1 Like

Give this a try? - http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/optimize_gpu.html

I have not tried them, so take a image before changing the settings.

Here’s my very unscientific explanation for this - Looks like this may be issue with PyTorch 0.2 with Cuda 9 and V100 Machines take advantage (requires) Cuda 9?

Pytorch 0.3 Release checklist has Cuda 9 and Cuda8 support -

Hopefully it should be available in next few weeks and wait until then to use V100 (P3) and even paperspace?

Alternatively, in this Doc from NVidia, they reference couple of pull requests to make PyTorch compatible with Cuda 9.

2 Likes

Ok, I tried doing this on both the p2 and p3 instances. Unfortunately, I did not observe any significant changes in runtime / epoch. The difference was less than 5s/it. :frowning: