Very slow loading of the Convnet pretrained model on lesson 1

We are not alone I guess.

4 Likes

Any fix for this? I am having the same issue!

Try this (details in reply #20)

In your notebook, you can set num_worker to 0 to avoid spawning sub processes to circumvent the need to lock.

Do this by passing num_workers=0 to ImageClassifierData.from_paths() function:

ImageClassifierData.from_paths(PATH, bs=2, tfms=tfms, num_workers=0)

(at 3 to 4 places in the notebook)

Let me know how it goes.

Yeah, I will try this.

Setting num_workers to 0 will indeed fix the problem, although you’ll find that it takes much longer to train.

I’d like to help fix this, but it’s a little hard since I can’t replicate it myself. For those having the problem, please try (one at a time, to see what works), the ideas suggested in https://github.com/pytorch/pytorch/issues/1355 :

  • sudo mount -o remount,size=8G /dev/shm (to increase shared memory)
  • Set num_workers=1 in ImageClassifier.from_paths (this will be slower than setting it to 4, but hopefully still a lot better than 0 )

@anandsaha I added a lock a few days ago in dataset.py line 101. That adds a lock around any transforms that are applied with opencv - I can’t think of any other places we’d need a lock. But to be sure, you could add a lock around the whole of BaseDataset.__getitem__(), and see if that helps.

Unfortunately increasing shared memory does not work for me. I am on my own box, using Ubuntu 16.04 with a GTX 1060 6GB (no docker).

What does get around the issue, however, is following the steps outlined by @metachi. No idea why though.

2 Likes

Yeah, yesterday It worked for me If I skipped that line from execution. Today, I will try to narrow it down. My rig is GTX 1070 8GB (no docker) ubuntu 16.0.

1 Like

@apaszke and the Pytorch team has kindly offered to help debug. To do so, they need access to a machine that has the problem. Specifically, we’re looking here only at the problem where it’s getting a deadlock in multiprocessing/connection.py (which you can see in the stack trace if you interrupt the process when it freezes).

So if anyone can allow us to login to your box that has this problem, could you please PM @apaszke and I and tell us:

  • Your public key
  • The username and IP to login to your box
  • The steps to replicate the problem

Before you do, please ensure you’ve set your shared mem to at least 8GB as I posted above, and have restarted Jupyter, and confirmed you still have the problem.

@jeremy

I have a box that’s live and running at the moment. Feel free to shoot me an email, and i can provide my credentials… email is apil.tamang@gmail.com

They should be able to ssh into it, and just about do anything. Guess i can trust them to it…

@apil.tamang from your description it sounds like you have a different problem. We’re looking for examples that have the specific stack trace I mentioned, and where the problem is related to num_workers=4. I think that doesn’t match your situation, does it? If it does, could you post an image or link to a gist showing the stack trace and preceding code when you hit this issue?

@jeremy
sure… i know i encounter the same symptom at two different locations: both during loading, and during the fitting. But let me try show you that. Give me around 15 min. of time though.

here’s a screenshot of it failing. The stacktrace is pretty deep… so I have it attached in a different text file:

1. Screenshot of ipython notebook

2: Text dump of entire stacktrace
Process Process-10:
Process Process-9:
Traceback (most recent call last):
Traceback (most recent call last):
File “/home/apil/anaconda3/envs/fastai/lib/python3.6/multiprocessing/process.py”, line 249, in _bootstrap
self.run()
File “/home/apil/anaconda3/envs/fastai/lib/python3.6/multiprocessing/process.py”, line 249, in _bootstrap
self.run()
File “/home/apil/anaconda3/envs/fastai/lib/python3.6/multiprocessing/process.py”, line 93, in run
self._target(*self._args, **self._kwargs)
File “/home/apil/anaconda3/envs/fastai/lib/python3.6/site-packages/torch/utils/data/dataloader.py”, line 34, in _worker_loop
r = index_queue.get()
File “/home/apil/anaconda3/envs/fastai/lib/python3.6/multiprocessing/process.py”, line 93, in run
self._target(*self._args, **self._kwargs)
File “/home/apil/anaconda3/envs/fastai/lib/python3.6/multiprocessing/queues.py”, line 342, in get
res = self._reader.recv_bytes()
File “/home/apil/anaconda3/envs/fastai/lib/python3.6/site-packages/torch/utils/data/dataloader.py”, line 34, in _worker_loop
r = index_queue.get()
File “/home/apil/anaconda3/envs/fastai/lib/python3.6/multiprocessing/connection.py”, line 216, in recv_bytes
buf = self._recv_bytes(maxlength)
File “/home/apil/anaconda3/envs/fastai/lib/python3.6/multiprocessing/queues.py”, line 341, in get
with self._rlock:
File “/home/apil/anaconda3/envs/fastai/lib/python3.6/multiprocessing/connection.py”, line 407, in _recv_bytes
buf = self._recv(4)
File “/home/apil/anaconda3/envs/fastai/lib/python3.6/multiprocessing/connection.py”, line 379, in _recv
chunk = read(handle, remaining)
File “/home/apil/anaconda3/envs/fastai/lib/python3.6/multiprocessing/synchronize.py”, line 96, in enter
return self._semlock.enter()
KeyboardInterrupt
KeyboardInterrupt
Process Process-12:
Process Process-13:
Process Process-11:
Traceback (most recent call last):
File “/home/apil/anaconda3/envs/fastai/lib/python3.6/multiprocessing/process.py”, line 249, in _bootstrap
self.run()
Process Process-15:
Process Process-14:
File “/home/apil/anaconda3/envs/fastai/lib/python3.6/multiprocessing/process.py”, line 93, in run
self._target(*self._args, **self._kwargs)
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
File “/home/apil/anaconda3/envs/fastai/lib/python3.6/site-packages/torch/utils/data/dataloader.py”, line 34, in _worker_loop
r = index_queue.get()
File “/home/apil/anaconda3/envs/fastai/lib/python3.6/multiprocessing/process.py”, line 249, in _bootstrap
self.run()
File “/home/apil/anaconda3/envs/fastai/lib/python3.6/multiprocessing/process.py”, line 249, in _bootstrap
self.run()
Traceback (most recent call last):
File “/home/apil/anaconda3/envs/fastai/lib/python3.6/multiprocessing/process.py”, line 249, in _bootstrap
self.run()
File “/home/apil/anaconda3/envs/fastai/lib/python3.6/multiprocessing/process.py”, line 93, in run
self._target(*self._args, **self._kwargs)
File “/home/apil/anaconda3/envs/fastai/lib/python3.6/multiprocessing/process.py”, line 93, in run
self._target(*self._args, **self._kwargs)
File “/home/apil/anaconda3/envs/fastai/lib/python3.6/multiprocessing/process.py”, line 93, in run
self._target(*self._args, **self._kwargs)
File “/home/apil/anaconda3/envs/fastai/lib/python3.6/multiprocessing/queues.py”, line 341, in get
with self._rlock:
File “/home/apil/anaconda3/envs/fastai/lib/python3.6/multiprocessing/process.py”, line 249, in _bootstrap
self.run()
File “/home/apil/anaconda3/envs/fastai/lib/python3.6/site-packages/torch/utils/data/dataloader.py”, line 34, in _worker_loop
r = index_queue.get()
File “/home/apil/anaconda3/envs/fastai/lib/python3.6/site-packages/torch/utils/data/dataloader.py”, line 34, in _worker_loop
r = index_queue.get()
File “/home/apil/anaconda3/envs/fastai/lib/python3.6/site-packages/torch/utils/data/dataloader.py”, line 34, in _worker_loop
r = index_queue.get()
File “/home/apil/anaconda3/envs/fastai/lib/python3.6/multiprocessing/process.py”, line 93, in run
self._target(*self._args, **self._kwargs)
File “/home/apil/anaconda3/envs/fastai/lib/python3.6/multiprocessing/queues.py”, line 341, in get
with self._rlock:
Process Process-16:
File “/home/apil/anaconda3/envs/fastai/lib/python3.6/multiprocessing/synchronize.py”, line 96, in enter
return self._semlock.enter()
File “/home/apil/anaconda3/envs/fastai/lib/python3.6/multiprocessing/queues.py”, line 341, in get
with self._rlock:
File “/home/apil/anaconda3/envs/fastai/lib/python3.6/site-packages/torch/utils/data/dataloader.py”, line 34, in _worker_loop
r = index_queue.get()
File “/home/apil/anaconda3/envs/fastai/lib/python3.6/multiprocessing/queues.py”, line 341, in get
with self._rlock:
File “/home/apil/anaconda3/envs/fastai/lib/python3.6/multiprocessing/synchronize.py”, line 96, in enter
return self._semlock.enter()
File “/home/apil/anaconda3/envs/fastai/lib/python3.6/multiprocessing/synchronize.py”, line 96, in enter
return self._semlock.enter()
Traceback (most recent call last):
File “/home/apil/anaconda3/envs/fastai/lib/python3.6/multiprocessing/synchronize.py”, line 96, in enter
return self._semlock.enter()
KeyboardInterrupt
File “/home/apil/anaconda3/envs/fastai/lib/python3.6/multiprocessing/process.py”, line 249, in _bootstrap
self.run()
File “/home/apil/anaconda3/envs/fastai/lib/python3.6/multiprocessing/queues.py”, line 341, in get
with self._rlock:
KeyboardInterrupt
File “/home/apil/anaconda3/envs/fastai/lib/python3.6/multiprocessing/process.py”, line 93, in run
self._target(*self._args, **self._kwargs)
KeyboardInterrupt
File “/home/apil/anaconda3/envs/fastai/lib/python3.6/site-packages/torch/utils/data/dataloader.py”, line 34, in _worker_loop
r = index_queue.get()
File “/home/apil/anaconda3/envs/fastai/lib/python3.6/multiprocessing/synchronize.py”, line 96, in enter
return self._semlock.enter()
KeyboardInterrupt
KeyboardInterrupt
File “/home/apil/anaconda3/envs/fastai/lib/python3.6/multiprocessing/queues.py”, line 341, in get
with self._rlock:
File “/home/apil/anaconda3/envs/fastai/lib/python3.6/multiprocessing/synchronize.py”, line 96, in enter
return self._semlock.enter()
KeyboardInterrupt

** Finally here’s my device shared memory numbers.** These are system defaults.

@jeremy @apaszke
Feel free to reach me out on "apil.tamang@gmail.com" and I can give you instructions on how to reach my box, including instructions to recreate it. You should be able to ssh into it, and recreate/analyze it.

Thanks @apil.tamang . Does it happen with num_workers=0 ? If so, can you post the same screen shot and stack trace details for that case?

will do… will even restart my machine for good measure…

I am not tried with the zero workers yet, If you need the machine after 6PM MST I will be available. Let me know if you need the machine.

Yep I can confirm the same thing for me. Following the directions from @metachi solves the issue for me as well.

Just to reiterate here, if I run only the first 5 cell blocks and then skip to augmentation section all of the code seems to run including with num_workers=4. It only locks up for me when I first run the “quick start” code blocks then proceed to run the augmentation code.

I also tried running all of the notebook all the way through with num_workers=0 and it works in that case as per the suggestions from @anandsaha . But to get it working with num_workers=4 then it is necessary to skip to augmentation directly after the first 5 code blocks. Can anyone else try this to verify?

2 Likes

@jeremy @apaszke

So, here’s the screenshot and the dump of the stacktrace using zero workers…

Dump:
Process Process-16:
Process Process-9:
Process Process-14:
Process Process-15:
Process Process-12:
Process Process-10:
Process Process-13:
Process Process-11:
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
File “/home/apil/anaconda3/envs/fastai/lib/python3.6/multiprocessing/process.py”, line 249, in _bootstrap
self.run()
File “/home/apil/anaconda3/envs/fastai/lib/python3.6/multiprocessing/process.py”, line 249, in _bootstrap
self.run()
File “/home/apil/anaconda3/envs/fastai/lib/python3.6/multiprocessing/process.py”, line 249, in _bootstrap
self.run()
File “/home/apil/anaconda3/envs/fastai/lib/python3.6/multiprocessing/process.py”, line 249, in _bootstrap
self.run()
File “/home/apil/anaconda3/envs/fastai/lib/python3.6/multiprocessing/process.py”, line 249, in _bootstrap
self.run()
File “/home/apil/anaconda3/envs/fastai/lib/python3.6/multiprocessing/process.py”, line 249, in _bootstrap
self.run()
File “/home/apil/anaconda3/envs/fastai/lib/python3.6/multiprocessing/process.py”, line 93, in run
self._target(*self._args, **self._kwargs)
File “/home/apil/anaconda3/envs/fastai/lib/python3.6/multiprocessing/process.py”, line 93, in run
self._target(*self._args, **self._kwargs)
File “/home/apil/anaconda3/envs/fastai/lib/python3.6/multiprocessing/process.py”, line 93, in run
self._target(*self._args, **self._kwargs)
File “/home/apil/anaconda3/envs/fastai/lib/python3.6/multiprocessing/process.py”, line 249, in _bootstrap
self.run()
File “/home/apil/anaconda3/envs/fastai/lib/python3.6/site-packages/torch/utils/data/dataloader.py”, line 34, in _worker_loop
r = index_queue.get()
File “/home/apil/anaconda3/envs/fastai/lib/python3.6/site-packages/torch/utils/data/dataloader.py”, line 34, in _worker_loop
r = index_queue.get()
File “/home/apil/anaconda3/envs/fastai/lib/python3.6/multiprocessing/process.py”, line 93, in run
self._target(*self._args, **self._kwargs)
File “/home/apil/anaconda3/envs/fastai/lib/python3.6/multiprocessing/process.py”, line 93, in run
self._target(*self._args, **self._kwargs)
File “/home/apil/anaconda3/envs/fastai/lib/python3.6/site-packages/torch/utils/data/dataloader.py”, line 34, in _worker_loop
r = index_queue.get()
File “/home/apil/anaconda3/envs/fastai/lib/python3.6/multiprocessing/process.py”, line 93, in run
self._target(*self._args, **self._kwargs)
File “/home/apil/anaconda3/envs/fastai/lib/python3.6/multiprocessing/process.py”, line 93, in run
self._target(*self._args, **self._kwargs)
File “/home/apil/anaconda3/envs/fastai/lib/python3.6/site-packages/torch/utils/data/dataloader.py”, line 34, in _worker_loop
r = index_queue.get()
File “/home/apil/anaconda3/envs/fastai/lib/python3.6/multiprocessing/process.py”, line 249, in _bootstrap
self.run()
File “/home/apil/anaconda3/envs/fastai/lib/python3.6/multiprocessing/queues.py”, line 341, in get
with self._rlock:
File “/home/apil/anaconda3/envs/fastai/lib/python3.6/multiprocessing/queues.py”, line 341, in get
with self._rlock:
File “/home/apil/anaconda3/envs/fastai/lib/python3.6/multiprocessing/process.py”, line 93, in run
self._target(*self._args, **self._kwargs)
File “/home/apil/anaconda3/envs/fastai/lib/python3.6/multiprocessing/queues.py”, line 341, in get
with self._rlock:
File “/home/apil/anaconda3/envs/fastai/lib/python3.6/site-packages/torch/utils/data/dataloader.py”, line 34, in _worker_loop
r = index_queue.get()
File “/home/apil/anaconda3/envs/fastai/lib/python3.6/site-packages/torch/utils/data/dataloader.py”, line 34, in _worker_loop
r = index_queue.get()
File “/home/apil/anaconda3/envs/fastai/lib/python3.6/site-packages/torch/utils/data/dataloader.py”, line 34, in _worker_loop
r = index_queue.get()
File “/home/apil/anaconda3/envs/fastai/lib/python3.6/multiprocessing/synchronize.py”, line 96, in enter
return self._semlock.enter()
File “/home/apil/anaconda3/envs/fastai/lib/python3.6/multiprocessing/synchronize.py”, line 96, in enter
return self._semlock.enter()
KeyboardInterrupt
File “/home/apil/anaconda3/envs/fastai/lib/python3.6/multiprocessing/synchronize.py”, line 96, in enter
return self._semlock.enter()
File “/home/apil/anaconda3/envs/fastai/lib/python3.6/multiprocessing/queues.py”, line 341, in get
with self._rlock:
File “/home/apil/anaconda3/envs/fastai/lib/python3.6/multiprocessing/queues.py”, line 341, in get
with self._rlock:
File “/home/apil/anaconda3/envs/fastai/lib/python3.6/site-packages/torch/utils/data/dataloader.py”, line 34, in _worker_loop
r = index_queue.get()
File “/home/apil/anaconda3/envs/fastai/lib/python3.6/multiprocessing/queues.py”, line 341, in get
with self._rlock:
KeyboardInterrupt
KeyboardInterrupt
File “/home/apil/anaconda3/envs/fastai/lib/python3.6/multiprocessing/synchronize.py”, line 96, in enter
return self._semlock.enter()
File “/home/apil/anaconda3/envs/fastai/lib/python3.6/multiprocessing/synchronize.py”, line 96, in enter
return self._semlock.enter()
File “/home/apil/anaconda3/envs/fastai/lib/python3.6/multiprocessing/synchronize.py”, line 96, in enter
return self._semlock.enter()
File “/home/apil/anaconda3/envs/fastai/lib/python3.6/multiprocessing/queues.py”, line 342, in get
res = self._reader.recv_bytes()
File “/home/apil/anaconda3/envs/fastai/lib/python3.6/multiprocessing/queues.py”, line 341, in get
with self._rlock:
KeyboardInterrupt
KeyboardInterrupt
KeyboardInterrupt
File “/home/apil/anaconda3/envs/fastai/lib/python3.6/multiprocessing/synchronize.py”, line 96, in enter
return self._semlock.enter()
File “/home/apil/anaconda3/envs/fastai/lib/python3.6/multiprocessing/connection.py”, line 216, in recv_bytes
buf = self._recv_bytes(maxlength)
File “/home/apil/anaconda3/envs/fastai/lib/python3.6/multiprocessing/connection.py”, line 407, in _recv_bytes
buf = self._recv(4)
KeyboardInterrupt
File “/home/apil/anaconda3/envs/fastai/lib/python3.6/multiprocessing/connection.py”, line 379, in _recv
chunk = read(handle, remaining)
KeyboardInterrupt

@anandsaha @jamesrequa @metachi do you have an instance we could log into to test this? Since @apil.tamang is getting a problem with num_workers=0 too, I don’t think he’s having the same problem, so we’re really looking for somewhere we can debug this directly.

1 Like

Same applies to me. I just did the same as James mentioned.

1 Like

wonder if this is CUDA version mismatch though. I know I’ve downloaded a pretrained model of ResNet and VGG-16 from PyTorch on my rig, and used to train on some independent datasets. The num_workers or the initialization didn’t have a problem.

Upon doing “conda list”, I saw the following:
pytorch 0.2.0 py36hf0d2509_4cu75

which means this was for Cuda 7.5. This is the default install using the provided environment.yml file. I think most of us are using Cuda 8.0 with Cudnn 6.x (at least I am). My other pytorch conda environment is based on the Cuda 8.0 install, and serves as a good-working reference.

So I’m updating it at the moment… and will let you know what happens.

Lo 'n behold… don’t see that problem any more !!!

UPDATE:

  • After updating to PyTorch for Python=3.6, Cuda=8.0, the problem with the slow convnet loading, training with multiple workers are resolved. I’m training with 8 workers now!
  • There is still an issue with the learn.fit(…) getting stuck at 0% after plotting all those images. So, just have to keep a note of that.