Very slow loading of the Convnet pretrained model on lesson 1

@mindtrinket ditto my position, although I will tell you that there seems to be some GPU utilization. GPU Memory stands at ~1 GB/11 GBs.

Also, fellow fastai-ers, does the progress-bar (while calling learn.fit(…)) work for you on your personal rigs? It would sometimes work, and other times I’d get errors such as

Failed to display Jupyter Widget of type HBox”, or
"Error rendering Jupyter widget. Widget not found: {“model_id”:"99baac9d3685403f829927503a05699e" etc.

FYI, I did try to enable the widgetsnbextension, and played around with different versions of ipywidgets, tqdm etc.

Regarding the “precompute” part of a fit I experienced exactly the same situation as @mindtrinket, both using Crestle and 2 local rigs (one using a GTX 1070 and another with a GTX Titan X). Using nvidia-smi apparently no gpu activity is displayed during the first 7-8 minutes, then the progress bar starts and progress is about 7 seconds per epoch.

@apil.tamang In my case the progress bar worked just fine.

1 Like

Did some preliminary analysis and this is what I got so far.

The random stalling is happening at:

fastai/model.py line 85:

84        t = tqdm(iter(data.trn_dl), leave=False)
85        for (*x,y) in t:
86            batch_num += 1

That’s when we ask for the next batch of data.

Now, data.trn_dl, which is suppose to get us the data, is an instance of ImageClassifierData(), which inherits from ModelData() and uses ModelDataLoader() to fetch next batch data. It uses multiple worker threads to do so (default is 4).

There might be a lock missing in ModelDataLoader() in __next__() to synchronise the workers.

I am putting a lock to check.

4 Likes

@Robi
I’ve experience that too… total silence for the first 5-7 minutes, then my learn.fit(…) finishes in a microsecond, lol !!

As a matter of fact, it seems like I’m running into just about every possible glitches everyone has had with the notebook (plus a few :slight_smile: ).

1 Like

Ok, so we are using torch’s torch.utils.data.DataLoader which takes a parameter called num_workers to fetch data with multiple sub processes.

num_workers (int, optional) – how many subprocesses to use for data loading.
 0 means that the data will be loaded in the main process (default: 0)

(defaut set in fastai lib is 4)

The stalling is happening when we try to get the next batch iterator from the DataLoader

fastai/dataset.py line 218


215    def __next__(self):
216        if self.i>=len(self.dl): raise StopIteration
217        self.i+=1
218        return next(self.it)

It just might be a bug in torch.utils.data.DataLoader

Temp Solution (I think this should work for now)

In your notebook, you can set this num_worker to 0 to avoid spawning sub processes to circumvent the need to lock.

Do this by passing num_workers=0 to ImageClassifierData.from_paths() function:

ImageClassifierData.from_paths(PATH, bs=2, tfms=tfms, num_workers=0)

(at 3 to 4 places in the notebook)

It should not lock for now (I tried 2 times after this change but did not stall/lock). Let me know if this solves temporarily.

@jeremy this should be okay for the meantime, right? Or is setting num_workers to 4 important?

In the meantime, we will need to see what’s happening with data fetching.

3 Likes

We are not alone I guess.

4 Likes

Any fix for this? I am having the same issue!

Try this (details in reply #20)

In your notebook, you can set num_worker to 0 to avoid spawning sub processes to circumvent the need to lock.

Do this by passing num_workers=0 to ImageClassifierData.from_paths() function:

ImageClassifierData.from_paths(PATH, bs=2, tfms=tfms, num_workers=0)

(at 3 to 4 places in the notebook)

Let me know how it goes.

Yeah, I will try this.

Setting num_workers to 0 will indeed fix the problem, although you’ll find that it takes much longer to train.

I’d like to help fix this, but it’s a little hard since I can’t replicate it myself. For those having the problem, please try (one at a time, to see what works), the ideas suggested in https://github.com/pytorch/pytorch/issues/1355 :

  • sudo mount -o remount,size=8G /dev/shm (to increase shared memory)
  • Set num_workers=1 in ImageClassifier.from_paths (this will be slower than setting it to 4, but hopefully still a lot better than 0 )

@anandsaha I added a lock a few days ago in dataset.py line 101. That adds a lock around any transforms that are applied with opencv - I can’t think of any other places we’d need a lock. But to be sure, you could add a lock around the whole of BaseDataset.__getitem__(), and see if that helps.

Unfortunately increasing shared memory does not work for me. I am on my own box, using Ubuntu 16.04 with a GTX 1060 6GB (no docker).

What does get around the issue, however, is following the steps outlined by @metachi. No idea why though.

2 Likes

Yeah, yesterday It worked for me If I skipped that line from execution. Today, I will try to narrow it down. My rig is GTX 1070 8GB (no docker) ubuntu 16.0.

1 Like

@apaszke and the Pytorch team has kindly offered to help debug. To do so, they need access to a machine that has the problem. Specifically, we’re looking here only at the problem where it’s getting a deadlock in multiprocessing/connection.py (which you can see in the stack trace if you interrupt the process when it freezes).

So if anyone can allow us to login to your box that has this problem, could you please PM @apaszke and I and tell us:

  • Your public key
  • The username and IP to login to your box
  • The steps to replicate the problem

Before you do, please ensure you’ve set your shared mem to at least 8GB as I posted above, and have restarted Jupyter, and confirmed you still have the problem.

@jeremy

I have a box that’s live and running at the moment. Feel free to shoot me an email, and i can provide my credentials… email is apil.tamang@gmail.com

They should be able to ssh into it, and just about do anything. Guess i can trust them to it…

@apil.tamang from your description it sounds like you have a different problem. We’re looking for examples that have the specific stack trace I mentioned, and where the problem is related to num_workers=4. I think that doesn’t match your situation, does it? If it does, could you post an image or link to a gist showing the stack trace and preceding code when you hit this issue?

@jeremy
sure… i know i encounter the same symptom at two different locations: both during loading, and during the fitting. But let me try show you that. Give me around 15 min. of time though.

here’s a screenshot of it failing. The stacktrace is pretty deep… so I have it attached in a different text file:

1. Screenshot of ipython notebook

2: Text dump of entire stacktrace
Process Process-10:
Process Process-9:
Traceback (most recent call last):
Traceback (most recent call last):
File “/home/apil/anaconda3/envs/fastai/lib/python3.6/multiprocessing/process.py”, line 249, in _bootstrap
self.run()
File “/home/apil/anaconda3/envs/fastai/lib/python3.6/multiprocessing/process.py”, line 249, in _bootstrap
self.run()
File “/home/apil/anaconda3/envs/fastai/lib/python3.6/multiprocessing/process.py”, line 93, in run
self._target(*self._args, **self._kwargs)
File “/home/apil/anaconda3/envs/fastai/lib/python3.6/site-packages/torch/utils/data/dataloader.py”, line 34, in _worker_loop
r = index_queue.get()
File “/home/apil/anaconda3/envs/fastai/lib/python3.6/multiprocessing/process.py”, line 93, in run
self._target(*self._args, **self._kwargs)
File “/home/apil/anaconda3/envs/fastai/lib/python3.6/multiprocessing/queues.py”, line 342, in get
res = self._reader.recv_bytes()
File “/home/apil/anaconda3/envs/fastai/lib/python3.6/site-packages/torch/utils/data/dataloader.py”, line 34, in _worker_loop
r = index_queue.get()
File “/home/apil/anaconda3/envs/fastai/lib/python3.6/multiprocessing/connection.py”, line 216, in recv_bytes
buf = self._recv_bytes(maxlength)
File “/home/apil/anaconda3/envs/fastai/lib/python3.6/multiprocessing/queues.py”, line 341, in get
with self._rlock:
File “/home/apil/anaconda3/envs/fastai/lib/python3.6/multiprocessing/connection.py”, line 407, in _recv_bytes
buf = self._recv(4)
File “/home/apil/anaconda3/envs/fastai/lib/python3.6/multiprocessing/connection.py”, line 379, in _recv
chunk = read(handle, remaining)
File “/home/apil/anaconda3/envs/fastai/lib/python3.6/multiprocessing/synchronize.py”, line 96, in enter
return self._semlock.enter()
KeyboardInterrupt
KeyboardInterrupt
Process Process-12:
Process Process-13:
Process Process-11:
Traceback (most recent call last):
File “/home/apil/anaconda3/envs/fastai/lib/python3.6/multiprocessing/process.py”, line 249, in _bootstrap
self.run()
Process Process-15:
Process Process-14:
File “/home/apil/anaconda3/envs/fastai/lib/python3.6/multiprocessing/process.py”, line 93, in run
self._target(*self._args, **self._kwargs)
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
File “/home/apil/anaconda3/envs/fastai/lib/python3.6/site-packages/torch/utils/data/dataloader.py”, line 34, in _worker_loop
r = index_queue.get()
File “/home/apil/anaconda3/envs/fastai/lib/python3.6/multiprocessing/process.py”, line 249, in _bootstrap
self.run()
File “/home/apil/anaconda3/envs/fastai/lib/python3.6/multiprocessing/process.py”, line 249, in _bootstrap
self.run()
Traceback (most recent call last):
File “/home/apil/anaconda3/envs/fastai/lib/python3.6/multiprocessing/process.py”, line 249, in _bootstrap
self.run()
File “/home/apil/anaconda3/envs/fastai/lib/python3.6/multiprocessing/process.py”, line 93, in run
self._target(*self._args, **self._kwargs)
File “/home/apil/anaconda3/envs/fastai/lib/python3.6/multiprocessing/process.py”, line 93, in run
self._target(*self._args, **self._kwargs)
File “/home/apil/anaconda3/envs/fastai/lib/python3.6/multiprocessing/process.py”, line 93, in run
self._target(*self._args, **self._kwargs)
File “/home/apil/anaconda3/envs/fastai/lib/python3.6/multiprocessing/queues.py”, line 341, in get
with self._rlock:
File “/home/apil/anaconda3/envs/fastai/lib/python3.6/multiprocessing/process.py”, line 249, in _bootstrap
self.run()
File “/home/apil/anaconda3/envs/fastai/lib/python3.6/site-packages/torch/utils/data/dataloader.py”, line 34, in _worker_loop
r = index_queue.get()
File “/home/apil/anaconda3/envs/fastai/lib/python3.6/site-packages/torch/utils/data/dataloader.py”, line 34, in _worker_loop
r = index_queue.get()
File “/home/apil/anaconda3/envs/fastai/lib/python3.6/site-packages/torch/utils/data/dataloader.py”, line 34, in _worker_loop
r = index_queue.get()
File “/home/apil/anaconda3/envs/fastai/lib/python3.6/multiprocessing/process.py”, line 93, in run
self._target(*self._args, **self._kwargs)
File “/home/apil/anaconda3/envs/fastai/lib/python3.6/multiprocessing/queues.py”, line 341, in get
with self._rlock:
Process Process-16:
File “/home/apil/anaconda3/envs/fastai/lib/python3.6/multiprocessing/synchronize.py”, line 96, in enter
return self._semlock.enter()
File “/home/apil/anaconda3/envs/fastai/lib/python3.6/multiprocessing/queues.py”, line 341, in get
with self._rlock:
File “/home/apil/anaconda3/envs/fastai/lib/python3.6/site-packages/torch/utils/data/dataloader.py”, line 34, in _worker_loop
r = index_queue.get()
File “/home/apil/anaconda3/envs/fastai/lib/python3.6/multiprocessing/queues.py”, line 341, in get
with self._rlock:
File “/home/apil/anaconda3/envs/fastai/lib/python3.6/multiprocessing/synchronize.py”, line 96, in enter
return self._semlock.enter()
File “/home/apil/anaconda3/envs/fastai/lib/python3.6/multiprocessing/synchronize.py”, line 96, in enter
return self._semlock.enter()
Traceback (most recent call last):
File “/home/apil/anaconda3/envs/fastai/lib/python3.6/multiprocessing/synchronize.py”, line 96, in enter
return self._semlock.enter()
KeyboardInterrupt
File “/home/apil/anaconda3/envs/fastai/lib/python3.6/multiprocessing/process.py”, line 249, in _bootstrap
self.run()
File “/home/apil/anaconda3/envs/fastai/lib/python3.6/multiprocessing/queues.py”, line 341, in get
with self._rlock:
KeyboardInterrupt
File “/home/apil/anaconda3/envs/fastai/lib/python3.6/multiprocessing/process.py”, line 93, in run
self._target(*self._args, **self._kwargs)
KeyboardInterrupt
File “/home/apil/anaconda3/envs/fastai/lib/python3.6/site-packages/torch/utils/data/dataloader.py”, line 34, in _worker_loop
r = index_queue.get()
File “/home/apil/anaconda3/envs/fastai/lib/python3.6/multiprocessing/synchronize.py”, line 96, in enter
return self._semlock.enter()
KeyboardInterrupt
KeyboardInterrupt
File “/home/apil/anaconda3/envs/fastai/lib/python3.6/multiprocessing/queues.py”, line 341, in get
with self._rlock:
File “/home/apil/anaconda3/envs/fastai/lib/python3.6/multiprocessing/synchronize.py”, line 96, in enter
return self._semlock.enter()
KeyboardInterrupt

** Finally here’s my device shared memory numbers.** These are system defaults.

@jeremy @apaszke
Feel free to reach me out on "apil.tamang@gmail.com" and I can give you instructions on how to reach my box, including instructions to recreate it. You should be able to ssh into it, and recreate/analyze it.

Thanks @apil.tamang . Does it happen with num_workers=0 ? If so, can you post the same screen shot and stack trace details for that case?

will do… will even restart my machine for good measure…

I am not tried with the zero workers yet, If you need the machine after 6PM MST I will be available. Let me know if you need the machine.