Very slow loading of the Convnet pretrained model on lesson 1

This is on a personal deep learning box…

For some reason, this line
"learn = ConvLearner.pretrained(resnet34, data, precompute=False)" seems to load very (very) slowly everytime I first start the jupyter notebook. It doesn’t matter also if I set “precompute=True or False”. I think it maybe trying to download the weights from the internet… but I couldn’t be sure. Anyone face this issue?

Also, another problem I’m facing is that this line:
learn.fit(1e-2, 3, cycle_len=1)
took forever to finish. It wasn’t helping that the interactive tqdm widget failed to load for me! Spent a lot of time trying to figure out why that was the case, and I don’t think I’m any closer.

And I’m sorry if some of you see this as a second post. I could’ve sworn I had posted this question, but I don’t see it in the forums at all!!

Just one of those days that nothing (nothing) seems to work!

2 Likes

I think you’ve asked that question here before. :slight_smile:

Try changing the PATH = “data/dogscats/sample/” and see if that executes successfully. It should come back fast (perhaps with failure) as sample dir only has few images.

Most likely you are running on a CPU and not GPU. So likely with sample data it will fail fast.

Relatedly, I noticed is that the fastai libraries have hardcoded cuda operations so probably only work on GPU.

Hey Apil, you can always run the command nvidia-smi to get an idea of what’s going on with your GPU.

I think we had similar specs (1080 Ti) and I started hitting similar issues when I ran that particular fit function after running all the previous cells in the notebook. It might be a memory problem. I was able to get it to work by restarting the kernel and running the first 5 commands (all the import commands and initializing global vars). Then skipping down and running these next:


2 Likes

Be sure to git pull. An earlier version from a few days ago had a locking bug.

If you still get this problem after pulling from git and restarting your kernel, click Kernel->Interrupt, and you should get a stack trace. Post that here so we can debug.

1 Like

I m getting the same issue with the latest version of the repo: this is the stack trace i added some print statement to do some debugging

---------------------------------------------------------------------------
KeyboardInterrupt                         Traceback (most recent call last)
<ipython-input-70-1bd2303a8a67> in <module>()
----> 1 learn = ConvLearner(data, model, precompute=True)

~/fastai/courses/dl1/fastai/conv_learner.py in __init__(self, data, models, precompute, **kwargs)
     80         elif self.metrics is None:
     81             self.metrics = [accuracy_multi] if self.data.is_multi else [accuracy]
---> 82         if precompute: self.save_fc1()
     83         print("Freezing")
     84         self.freeze()

~/fastai/courses/dl1/fastai/conv_learner.py in save_fc1(self)
    124             m=self.models.top_model
    125             print("predict_to_bcolz(m, self.data.fix_dl, act)")
--> 126             predict_to_bcolz(m, self.data.fix_dl, act)
    127             print("predict_to_bcolz(m, self.data.val_dl, val_act)")
    128             predict_to_bcolz(m, self.data.val_dl, val_act)

~/fastai/courses/dl1/fastai/model.py in predict_to_bcolz(m, gen, arr, workers)
     17     m.eval()
     18     print("finished m.eval")
---> 19     for x,*_ in tqdm(gen):
     20         print("doing stuff in a loop")
     21         y = to_np(m(VV(x)).data)

~/anaconda2/envs/fastai/lib/python3.6/site-packages/tqdm/_tqdm.py in __iter__(self)
    870 """, fp_write=getattr(self.fp, 'write', sys.stderr.write))
    871 
--> 872             for obj in iterable:
    873                 yield obj
    874                 # Update and print the progressbar.

~/fastai/courses/dl1/fastai/dataset.py in __next__(self)
    219         if self.i>=len(self.dl): raise StopIteration
    220         self.i+=1
--> 221         return next(self.it)
    222 
    223     @property

~/anaconda2/envs/fastai/lib/python3.6/site-packages/torch/utils/data/dataloader.py in __next__(self)
    193         while True:
    194             assert (not self.shutdown and self.batches_outstanding > 0)
--> 195             idx, batch = self.data_queue.get()
    196             self.batches_outstanding -= 1
    197             if idx != self.rcvd_idx:

~/anaconda2/envs/fastai/lib/python3.6/multiprocessing/queues.py in get(self)
    340     def get(self):
    341         with self._rlock:
--> 342             res = self._reader.recv_bytes()
    343         # unserialize the data after having released the lock
    344         return _ForkingPickler.loads(res)

~/anaconda2/envs/fastai/lib/python3.6/multiprocessing/connection.py in recv_bytes(self, maxlength)
    214         if maxlength is not None and maxlength < 0:
    215             raise ValueError("negative maxlength")
--> 216         buf = self._recv_bytes(maxlength)
    217         if buf is None:
    218             self._bad_message_length()

~/anaconda2/envs/fastai/lib/python3.6/multiprocessing/connection.py in _recv_bytes(self, maxsize)
    405 
    406     def _recv_bytes(self, maxsize=None):
--> 407         buf = self._recv(4)
    408         size, = struct.unpack("!i", buf.getvalue())
    409         if maxsize is not None and size > maxsize:

~/anaconda2/envs/fastai/lib/python3.6/multiprocessing/connection.py in _recv(self, size, read)
    377         remaining = size
    378         while remaining > 0:
--> 379             chunk = read(handle, remaining)
    380             n = len(chunk)
    381             if n == 0:

Are you running this on crestle? If not, can you see if it works OK for you there?

Crestle is giving me a cuda runtime error (2) : out of memory at /pytorch/torch/lib/THC/generic/THCStorage.cu:66
Which was why i switched to my own aws box

Additionally:

data = ImageClassifierData.from_paths(PATH, tfms=tfms_from_model(resnet34, sz))
learn = ConvLearner.pretrained(resnet34, data, precompute=True)
learn.fit(0.01, 1)

worked for me the first time i ran. It is failing on repeat runs

On crestle, sounds like you need to restart your notebook. And try rebooting your AWS instance.

Hi Jeremy,

I am having a similar locking issue using the most recent version of the repo for lesson 1 notebook. I am using my own DL box with GTX1070 running on Ubuntu 16.04

For me everything in the notebook runs great as long as I have learn.precompute=True but once I try to run learn.fit(1e-2, 3, clycle_len=1) with learn.precompute=False then it locks up. See below is the traceback.

Any help you can provide is much appreciated!!

> ---------------------------------------------------------------------------
> KeyboardInterrupt                         Traceback (most recent call last)
> <ipython-input-40-5057ed0f08f5> in <module>()
> ----> 1 learn.fit(1e-2, n_cycle=3, cycle_len=1)
> 
> /home/james/fastai/courses/dl1/fastai/learner.py in fit(self, lrs, n_cycle, wds, **kwargs)
>  98         self.sched = None
>  99         layer_opt = self.get_layer_opt(lrs, wds)
> --> 100         self.fit_gen(self.model, self.data, layer_opt, n_cycle, **kwargs)
> 101 
> 102     def lr_find(self, start_lr=1e-5, end_lr=10, wds=None):
> 
> /home/james/fastai/courses/dl1/fastai/learner.py in fit_gen(self, model, data, layer_opt, n_cycle, cycle_len, cycle_mult, cycle_save_name, metrics, callbacks, **kwargs)
>  88         n_epoch = sum_geom(cycle_len if cycle_len else 1, cycle_mult, n_cycle)
>  89         fit(model, data, n_epoch, layer_opt.opt, self.crit,
> ---> 90             metrics=metrics, callbacks=callbacks, reg_fn=self.reg_fn, clip=self.clip, **kwargs)
>  91 
>  92     def get_layer_groups(self): return self.models.get_layer_groups()
> 
> /home/james/fastai/courses/dl1/fastai/model.py in fit(model, data, epochs, opt, crit, metrics, callbacks, **kwargs)
>  83         stepper.reset(True)
>  84         t = tqdm(iter(data.trn_dl), leave=False)
> ---> 85         for (*x,y) in t:
>  86             batch_num += 1
>  87             loss = stepper.step(V(x),V(y))
> 
> /home/james/anaconda3/envs/tensorflow/lib/python3.6/site-packages/tqdm/_tqdm.py in __iter__(self)
> 870 """, fp_write=getattr(self.fp, 'write', sys.stderr.write))
> 871 
> --> 872             for obj in iterable:
> 873                 yield obj
> 874                 # Update and print the progressbar.
> 
> /home/james/fastai/courses/dl1/fastai/dataset.py in __next__(self)
> 219         if self.i>=len(self.dl): raise StopIteration
> 220         self.i+=1
> --> 221         return next(self.it)
> 222 
> 223     @property
> 
> /home/james/anaconda3/envs/tensorflow/lib/python3.6/site-packages/torch/utils/data/dataloader.py in __next__(self)
> 193         while True:
> 194             assert (not self.shutdown and self.batches_outstanding > 0)
> --> 195             idx, batch = self.data_queue.get()
> 196             self.batches_outstanding -= 1
> 197             if idx != self.rcvd_idx:
> 
> /home/james/anaconda3/envs/tensorflow/lib/python3.6/multiprocessing/queues.py in get(self)
> 341     def get(self):
> 342         with self._rlock:
> --> 343             res = self._reader.recv_bytes()
> 344         # unserialize the data after having released the lock
> 345         return _ForkingPickler.loads(res)
> 
> /home/james/anaconda3/envs/tensorflow/lib/python3.6/multiprocessing/connection.py in recv_bytes(self, maxlength)
> 214         if maxlength is not None and maxlength < 0:
> 215             raise ValueError("negative maxlength")
> --> 216         buf = self._recv_bytes(maxlength)
> 217         if buf is None:
> 218             self._bad_message_length()
> 
> /home/james/anaconda3/envs/tensorflow/lib/python3.6/multiprocessing/connection.py in _recv_bytes(self, maxsize)
> 405 
> 406     def _recv_bytes(self, maxsize=None):
> --> 407         buf = self._recv(4)
> 408         size, = struct.unpack("!i", buf.getvalue())
> 409         if maxsize is not None and size > maxsize:
> 
> /home/james/anaconda3/envs/tensorflow/lib/python3.6/multiprocessing/connection.py in _recv(self, size, read)
> 377         remaining = size
> 378         while remaining > 0:
> --> 379             chunk = read(handle, remaining)
> 380             n = len(chunk)
> 381             if n == 0:
> 
> KeyboardInterrupt:

Hmmm OK I’m not sure why you’re seeing this… I had seen it a few times but adding a lock inside fastai.dataset seemed to fix it for me.

Searching the Pytorch forums I see that others have seen this problem when using Docker. Are you guys using Docker?

One thing that helped me avoid this on crestle (which runs on Docker) was to add time.sleep(2) before each method that used the GPU.

Restart on aws instance fixes the first run of ConvLearner.pretrained(resnet34, data, precompute=True) still has issues on subsequent runs. Crestle works fine with multiple runs.

What if you add the time.sleep suggestion in AWS? Is this on a p2?

In my case I’m not using Docker. So it doesn’t seem like this issue is only related to Docker users.

I had the same problem on my DL rig.

First time I ran the code it had to download the resnet34 model to the Pytorch models folder. The first time I thought it was taking too long and killed it before I should have. The failed download froze the function every time. My fix was going to the Pytorch models (~/.torch/models) and removing resnet34 to download it again while getting coffee.

However, like others, I am still confused on the precompute function. It seems that true or false it will run 6 minutes. I am surprised there is no GPU utilization (nvidia-smi) during this time. GPU does go up to 90% during the fit.

Then the “20 second” function takes 3.5 minutes.

1 Like

@mindtrinket ditto my position, although I will tell you that there seems to be some GPU utilization. GPU Memory stands at ~1 GB/11 GBs.

Also, fellow fastai-ers, does the progress-bar (while calling learn.fit(…)) work for you on your personal rigs? It would sometimes work, and other times I’d get errors such as

Failed to display Jupyter Widget of type HBox”, or
"Error rendering Jupyter widget. Widget not found: {“model_id”:"99baac9d3685403f829927503a05699e" etc.

FYI, I did try to enable the widgetsnbextension, and played around with different versions of ipywidgets, tqdm etc.

Regarding the “precompute” part of a fit I experienced exactly the same situation as @mindtrinket, both using Crestle and 2 local rigs (one using a GTX 1070 and another with a GTX Titan X). Using nvidia-smi apparently no gpu activity is displayed during the first 7-8 minutes, then the progress bar starts and progress is about 7 seconds per epoch.

@apil.tamang In my case the progress bar worked just fine.

1 Like

Did some preliminary analysis and this is what I got so far.

The random stalling is happening at:

fastai/model.py line 85:

84        t = tqdm(iter(data.trn_dl), leave=False)
85        for (*x,y) in t:
86            batch_num += 1

That’s when we ask for the next batch of data.

Now, data.trn_dl, which is suppose to get us the data, is an instance of ImageClassifierData(), which inherits from ModelData() and uses ModelDataLoader() to fetch next batch data. It uses multiple worker threads to do so (default is 4).

There might be a lock missing in ModelDataLoader() in __next__() to synchronise the workers.

I am putting a lock to check.

4 Likes

@Robi
I’ve experience that too… total silence for the first 5-7 minutes, then my learn.fit(…) finishes in a microsecond, lol !!

As a matter of fact, it seems like I’m running into just about every possible glitches everyone has had with the notebook (plus a few :slight_smile: ).

1 Like

Ok, so we are using torch’s torch.utils.data.DataLoader which takes a parameter called num_workers to fetch data with multiple sub processes.

num_workers (int, optional) – how many subprocesses to use for data loading.
 0 means that the data will be loaded in the main process (default: 0)

(defaut set in fastai lib is 4)

The stalling is happening when we try to get the next batch iterator from the DataLoader

fastai/dataset.py line 218


215    def __next__(self):
216        if self.i>=len(self.dl): raise StopIteration
217        self.i+=1
218        return next(self.it)

It just might be a bug in torch.utils.data.DataLoader

Temp Solution (I think this should work for now)

In your notebook, you can set this num_worker to 0 to avoid spawning sub processes to circumvent the need to lock.

Do this by passing num_workers=0 to ImageClassifierData.from_paths() function:

ImageClassifierData.from_paths(PATH, bs=2, tfms=tfms, num_workers=0)

(at 3 to 4 places in the notebook)

It should not lock for now (I tried 2 times after this change but did not stall/lock). Let me know if this solves temporarily.

@jeremy this should be okay for the meantime, right? Or is setting num_workers to 4 important?

In the meantime, we will need to see what’s happening with data fetching.

3 Likes