Very slow loading of the Convnet pretrained model on lesson 1

Hi Jeremy,

I am having a similar locking issue using the most recent version of the repo for lesson 1 notebook. I am using my own DL box with GTX1070 running on Ubuntu 16.04

For me everything in the notebook runs great as long as I have learn.precompute=True but once I try to run learn.fit(1e-2, 3, clycle_len=1) with learn.precompute=False then it locks up. See below is the traceback.

Any help you can provide is much appreciated!!

> ---------------------------------------------------------------------------
> KeyboardInterrupt                         Traceback (most recent call last)
> <ipython-input-40-5057ed0f08f5> in <module>()
> ----> 1 learn.fit(1e-2, n_cycle=3, cycle_len=1)
> 
> /home/james/fastai/courses/dl1/fastai/learner.py in fit(self, lrs, n_cycle, wds, **kwargs)
>  98         self.sched = None
>  99         layer_opt = self.get_layer_opt(lrs, wds)
> --> 100         self.fit_gen(self.model, self.data, layer_opt, n_cycle, **kwargs)
> 101 
> 102     def lr_find(self, start_lr=1e-5, end_lr=10, wds=None):
> 
> /home/james/fastai/courses/dl1/fastai/learner.py in fit_gen(self, model, data, layer_opt, n_cycle, cycle_len, cycle_mult, cycle_save_name, metrics, callbacks, **kwargs)
>  88         n_epoch = sum_geom(cycle_len if cycle_len else 1, cycle_mult, n_cycle)
>  89         fit(model, data, n_epoch, layer_opt.opt, self.crit,
> ---> 90             metrics=metrics, callbacks=callbacks, reg_fn=self.reg_fn, clip=self.clip, **kwargs)
>  91 
>  92     def get_layer_groups(self): return self.models.get_layer_groups()
> 
> /home/james/fastai/courses/dl1/fastai/model.py in fit(model, data, epochs, opt, crit, metrics, callbacks, **kwargs)
>  83         stepper.reset(True)
>  84         t = tqdm(iter(data.trn_dl), leave=False)
> ---> 85         for (*x,y) in t:
>  86             batch_num += 1
>  87             loss = stepper.step(V(x),V(y))
> 
> /home/james/anaconda3/envs/tensorflow/lib/python3.6/site-packages/tqdm/_tqdm.py in __iter__(self)
> 870 """, fp_write=getattr(self.fp, 'write', sys.stderr.write))
> 871 
> --> 872             for obj in iterable:
> 873                 yield obj
> 874                 # Update and print the progressbar.
> 
> /home/james/fastai/courses/dl1/fastai/dataset.py in __next__(self)
> 219         if self.i>=len(self.dl): raise StopIteration
> 220         self.i+=1
> --> 221         return next(self.it)
> 222 
> 223     @property
> 
> /home/james/anaconda3/envs/tensorflow/lib/python3.6/site-packages/torch/utils/data/dataloader.py in __next__(self)
> 193         while True:
> 194             assert (not self.shutdown and self.batches_outstanding > 0)
> --> 195             idx, batch = self.data_queue.get()
> 196             self.batches_outstanding -= 1
> 197             if idx != self.rcvd_idx:
> 
> /home/james/anaconda3/envs/tensorflow/lib/python3.6/multiprocessing/queues.py in get(self)
> 341     def get(self):
> 342         with self._rlock:
> --> 343             res = self._reader.recv_bytes()
> 344         # unserialize the data after having released the lock
> 345         return _ForkingPickler.loads(res)
> 
> /home/james/anaconda3/envs/tensorflow/lib/python3.6/multiprocessing/connection.py in recv_bytes(self, maxlength)
> 214         if maxlength is not None and maxlength < 0:
> 215             raise ValueError("negative maxlength")
> --> 216         buf = self._recv_bytes(maxlength)
> 217         if buf is None:
> 218             self._bad_message_length()
> 
> /home/james/anaconda3/envs/tensorflow/lib/python3.6/multiprocessing/connection.py in _recv_bytes(self, maxsize)
> 405 
> 406     def _recv_bytes(self, maxsize=None):
> --> 407         buf = self._recv(4)
> 408         size, = struct.unpack("!i", buf.getvalue())
> 409         if maxsize is not None and size > maxsize:
> 
> /home/james/anaconda3/envs/tensorflow/lib/python3.6/multiprocessing/connection.py in _recv(self, size, read)
> 377         remaining = size
> 378         while remaining > 0:
> --> 379             chunk = read(handle, remaining)
> 380             n = len(chunk)
> 381             if n == 0:
> 
> KeyboardInterrupt:

Hmmm OK I’m not sure why you’re seeing this… I had seen it a few times but adding a lock inside fastai.dataset seemed to fix it for me.

Searching the Pytorch forums I see that others have seen this problem when using Docker. Are you guys using Docker?

One thing that helped me avoid this on crestle (which runs on Docker) was to add time.sleep(2) before each method that used the GPU.

Restart on aws instance fixes the first run of ConvLearner.pretrained(resnet34, data, precompute=True) still has issues on subsequent runs. Crestle works fine with multiple runs.

What if you add the time.sleep suggestion in AWS? Is this on a p2?

In my case I’m not using Docker. So it doesn’t seem like this issue is only related to Docker users.

I had the same problem on my DL rig.

First time I ran the code it had to download the resnet34 model to the Pytorch models folder. The first time I thought it was taking too long and killed it before I should have. The failed download froze the function every time. My fix was going to the Pytorch models (~/.torch/models) and removing resnet34 to download it again while getting coffee.

However, like others, I am still confused on the precompute function. It seems that true or false it will run 6 minutes. I am surprised there is no GPU utilization (nvidia-smi) during this time. GPU does go up to 90% during the fit.

Then the “20 second” function takes 3.5 minutes.

1 Like

@mindtrinket ditto my position, although I will tell you that there seems to be some GPU utilization. GPU Memory stands at ~1 GB/11 GBs.

Also, fellow fastai-ers, does the progress-bar (while calling learn.fit(…)) work for you on your personal rigs? It would sometimes work, and other times I’d get errors such as

Failed to display Jupyter Widget of type HBox”, or
"Error rendering Jupyter widget. Widget not found: {“model_id”:"99baac9d3685403f829927503a05699e" etc.

FYI, I did try to enable the widgetsnbextension, and played around with different versions of ipywidgets, tqdm etc.

Regarding the “precompute” part of a fit I experienced exactly the same situation as @mindtrinket, both using Crestle and 2 local rigs (one using a GTX 1070 and another with a GTX Titan X). Using nvidia-smi apparently no gpu activity is displayed during the first 7-8 minutes, then the progress bar starts and progress is about 7 seconds per epoch.

@apil.tamang In my case the progress bar worked just fine.

1 Like

Did some preliminary analysis and this is what I got so far.

The random stalling is happening at:

fastai/model.py line 85:

84        t = tqdm(iter(data.trn_dl), leave=False)
85        for (*x,y) in t:
86            batch_num += 1

That’s when we ask for the next batch of data.

Now, data.trn_dl, which is suppose to get us the data, is an instance of ImageClassifierData(), which inherits from ModelData() and uses ModelDataLoader() to fetch next batch data. It uses multiple worker threads to do so (default is 4).

There might be a lock missing in ModelDataLoader() in __next__() to synchronise the workers.

I am putting a lock to check.

4 Likes

@Robi
I’ve experience that too… total silence for the first 5-7 minutes, then my learn.fit(…) finishes in a microsecond, lol !!

As a matter of fact, it seems like I’m running into just about every possible glitches everyone has had with the notebook (plus a few :slight_smile: ).

1 Like

Ok, so we are using torch’s torch.utils.data.DataLoader which takes a parameter called num_workers to fetch data with multiple sub processes.

num_workers (int, optional) – how many subprocesses to use for data loading.
 0 means that the data will be loaded in the main process (default: 0)

(defaut set in fastai lib is 4)

The stalling is happening when we try to get the next batch iterator from the DataLoader

fastai/dataset.py line 218


215    def __next__(self):
216        if self.i>=len(self.dl): raise StopIteration
217        self.i+=1
218        return next(self.it)

It just might be a bug in torch.utils.data.DataLoader

Temp Solution (I think this should work for now)

In your notebook, you can set this num_worker to 0 to avoid spawning sub processes to circumvent the need to lock.

Do this by passing num_workers=0 to ImageClassifierData.from_paths() function:

ImageClassifierData.from_paths(PATH, bs=2, tfms=tfms, num_workers=0)

(at 3 to 4 places in the notebook)

It should not lock for now (I tried 2 times after this change but did not stall/lock). Let me know if this solves temporarily.

@jeremy this should be okay for the meantime, right? Or is setting num_workers to 4 important?

In the meantime, we will need to see what’s happening with data fetching.

3 Likes

We are not alone I guess.

4 Likes

Any fix for this? I am having the same issue!

Try this (details in reply #20)

In your notebook, you can set num_worker to 0 to avoid spawning sub processes to circumvent the need to lock.

Do this by passing num_workers=0 to ImageClassifierData.from_paths() function:

ImageClassifierData.from_paths(PATH, bs=2, tfms=tfms, num_workers=0)

(at 3 to 4 places in the notebook)

Let me know how it goes.

Yeah, I will try this.

Setting num_workers to 0 will indeed fix the problem, although you’ll find that it takes much longer to train.

I’d like to help fix this, but it’s a little hard since I can’t replicate it myself. For those having the problem, please try (one at a time, to see what works), the ideas suggested in https://github.com/pytorch/pytorch/issues/1355 :

  • sudo mount -o remount,size=8G /dev/shm (to increase shared memory)
  • Set num_workers=1 in ImageClassifier.from_paths (this will be slower than setting it to 4, but hopefully still a lot better than 0 )

@anandsaha I added a lock a few days ago in dataset.py line 101. That adds a lock around any transforms that are applied with opencv - I can’t think of any other places we’d need a lock. But to be sure, you could add a lock around the whole of BaseDataset.__getitem__(), and see if that helps.

Unfortunately increasing shared memory does not work for me. I am on my own box, using Ubuntu 16.04 with a GTX 1060 6GB (no docker).

What does get around the issue, however, is following the steps outlined by @metachi. No idea why though.

2 Likes

Yeah, yesterday It worked for me If I skipped that line from execution. Today, I will try to narrow it down. My rig is GTX 1070 8GB (no docker) ubuntu 16.0.

1 Like

@apaszke and the Pytorch team has kindly offered to help debug. To do so, they need access to a machine that has the problem. Specifically, we’re looking here only at the problem where it’s getting a deadlock in multiprocessing/connection.py (which you can see in the stack trace if you interrupt the process when it freezes).

So if anyone can allow us to login to your box that has this problem, could you please PM @apaszke and I and tell us:

  • Your public key
  • The username and IP to login to your box
  • The steps to replicate the problem

Before you do, please ensure you’ve set your shared mem to at least 8GB as I posted above, and have restarted Jupyter, and confirmed you still have the problem.

@jeremy

I have a box that’s live and running at the moment. Feel free to shoot me an email, and i can provide my credentials… email is apil.tamang@gmail.com

They should be able to ssh into it, and just about do anything. Guess i can trust them to it…