Very slow loading of the Convnet pretrained model on lesson 1

Have you done conda env update? What’s your machine/OS/etc? What do you see if you display learner.activations before fitting?

Also does git status show you have any local changes? If so, try reverting them.

I have a GTX1070 running on Ubuntu 16.04, Cuda 8.0.
No I didn’t run conda env update.
I discarded any local changes before doing the git pull.
See below is my output from learner.activations

[carray((23000, 1024), float32)
   nbytes: 89.84 MB; cbytes: 93.01 MB; ratio: 0.97
   cparams := cparams(clevel=5, shuffle=1, cname='blosclz')
   rootdir := 'data/dogscats/tmp/x_act_resnet34_0_224.bc'
   mode    := 'a'
 [[  2.44967175   2.99363661   4.33913231 ...,   0.13268621   2.84894609
  [  4.21488714   5.84484339  11.52380562 ...,   0.25312638   1.17774951
  [  0.79310578   3.97656846   9.34537125 ...,   0.56532079   1.67562604
  [  2.92073274   2.12234163   5.58415794 ...,   0.93900502   0.07328516
  [  1.85395586  11.13914871   1.54042947 ...,   2.02693963   0.58828402
  [  0.94521332   1.10611272   4.41981649 ...,   0.04808487   1.97100222
     0.54948312]], carray((2000, 1024), float32)
   nbytes: 7.81 MB; cbytes: 8.09 MB; ratio: 0.97
   cparams := cparams(clevel=5, shuffle=1, cname='blosclz')
   rootdir := 'data/dogscats/tmp/x_act_val_resnet34_0_224.bc'
   mode    := 'a'
 [[  1.29020107   2.32586789   6.74245548 ...,   0.70285255   1.60787117
     0.0493584 ]
  [  4.4050312    4.15314722  10.62083817 ...,   0.51982337   1.14450645
  [  2.00788641   1.18459272   5.34251404 ...,   0.11724961   1.58721673
  [  1.81696892   2.15313721   8.69116592 ...,   0.94192082   0.56452471
  [  1.2544719    3.25818467   4.1094799  ...,   0.60879725   0.04835358
  [  5.00699234   2.93570065   4.85688496 ...,   0.19603917   1.10524523
     0.09501588]], carray((0, 1024), float32)
   nbytes: 0; cbytes: 4.00 KB; ratio: 0.00
   cparams := cparams(clevel=5, shuffle=1, cname='blosclz')
   rootdir := 'data/dogscats/tmp/x_act_test_resnet34_0_224.bc'
   mode    := 'a'

I had the same problem with that line, 3, cycle_len=1) running on my home DL rig (Ubuntu + GTX 1080Ti), it would not even start.
It turned out that I was running an older version of github. As Jeremy pointed out in the thread, once I uploaded the latest version, it worked fine.

I have a little workaround…As long as I just point to Pytorch’s DataLoader class (instead of the custom one) everything works as expected on my machine without any memory issues with the most recent version of the repo. I wouldn’t want to hold anyone up debugging an issue that only appears to be happening on my machine. So all is good on my end! :slight_smile:

Thanks - I suspect someone else will have the same problem, but we’ll see…

You’re not alone :upside_down:

I’ll try and look deeper into it tomorrow, following your tips.

Does decreasing num_workers when creating data also decrease your memory use?

num_workers=0 generates an error max_workers must be greater than 0 in Cell #35
ims = np.stack([get_augs() for i in range(6)])

Trying num_workers=1: apart from being a lot slower, as my CPU is an I5-4690K with 4 cores (4 workers capable ?), it still slowly but surely filled up the 16Gb RAM before 50% of the 1st epoch and reached another 19Gb of the Swap before finishing, in Cell #41, 3, cycle_len=1)

Once finished, the 16Gb of RAM are still fully-used, and the Swap has 7Gb used.

Can you try downloading the latest version of anaconda and installing that from scratch, and doing conda env update there, to see if there’s some module issue going on?

azerty will try

I’m not alone anymore :slight_smile: If you’re looking for a temporary workaround so you can run the notebooks, the following minor changes worked for me:

  1. Comment this line in
    #from .dataloader import DataLoader

  2. Change this line in
    from import DataLoader, Dataset, TensorDataset (Add DataLoader to the imports)


Thanks @jamesrequa :+1:

Your 2 lines did the trick, both the Lesson 1 and Lesson 2 notebooks run fine without loading over 10Gb of RAM, and no Swap.

There’s a single error message Widget Javascript not detected. It may not be installed or enabled properly. that keeps popping up at every epoch cell, but I can’t see an actual difference in display with the saved notebooks pre-mods.

1 Like

can relate, i tried reading some of them full of mysteries and cryptic info

Facing Memory and BrokenProcessPool errors

Hey Everyone,

Was everybody else able to resolve their issues, I break the Jupyter notebook everytime at some step because of a MemoryError or a BrokenProcessPool error, and the only thing left at that point is to back start from scratch train the models again and then repeat, it’s kinda frustrating. I think the major problem behind these errors is the Limited memory that comes with the g2.2x large instance having Tesla K520 Grid GPU i.e. 4096MiB. Can’t figure out the issues but I hope someone else can help here, it does get overloaded very quickly with GPU training.

Right now, it seems g2.2x is one of the instances that Amazon is going to retire very soon, since it’s mentions are nowhere to be found in their listings for Accelerated Computing, maybe not many people are using it. But for now I want to make it work.

Things I have tried

Changing the num_workers from default of 8 to 4 and 0 doesn’t work because of AssertionError and changing batch size from 64 to 32 to 2, these sometimes work but make the training extremely slow, which is not what I want right now. Nothing seems to solve process pool errors, that I encounter at some point or other. Another thing that sometimes work is skipping right away to the step I want to perform leaving the other steps behind, but since there’s a sequence @jeremy specifies at the end of notebook, it’s not good enough for state of the art model.

Detailed Error Info

I got this while running, 3, cycle_len=1, cycle_mult=2), clueless about this error but it mentions both process pool and calls itself a queuing error, so I think it’s happening because of memory being full.

Note: I only reached this step, as I rebooted my computer, left the intermediary steps and just ran the data augmentation and this part successfully, batch sizes and other options were normal for ImageClassifierData

A Jupyter Widget

  9%|▉         | 34/360 [01:05<10:30,  1.93s/it, loss=0.0872]

Process Process-75:
Traceback (most recent call last):
  File "/home/hannan/.conda/envs/fastai/lib/python3.6/multiprocessing/", line 258, in _bootstrap
  File "/home/hannan/.conda/envs/fastai/lib/python3.6/multiprocessing/", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/home/hannan/.conda/envs/fastai/lib/python3.6/concurrent/futures/", line 181, in _process_worker
  File "/home/hannan/.conda/envs/fastai/lib/python3.6/multiprocessing/", line 341, in put
    obj = _ForkingPickler.dumps(obj)
  File "/home/hannan/.conda/envs/fastai/lib/python3.6/multiprocessing/", line 51, in dumps
    cls(buf, protocol).dump(obj)

 10%|▉         | 35/360 [01:08<10:33,  1.95s/it, loss=0.0862]

Exception in thread Thread-24:
Traceback (most recent call last):
  File "/home/hannan/.conda/envs/fastai/lib/python3.6/", line 916, in _bootstrap_inner
  File "/home/hannan/.conda/envs/fastai/lib/python3.6/", line 864, in run
    self._target(*self._args, **self._kwargs)
  File "/home/hannan/.conda/envs/fastai/lib/python3.6/concurrent/futures/", line 295, in _queue_management_worker
  File "/home/hannan/.conda/envs/fastai/lib/python3.6/concurrent/futures/", line 253, in shutdown_worker
  File "/home/hannan/.conda/envs/fastai/lib/python3.6/multiprocessing/", line 129, in put_nowait
    return self.put(obj, False)
  File "/home/hannan/.conda/envs/fastai/lib/python3.6/multiprocessing/", line 83, in put
    raise Full

Try doing a git pull. I’ve just changed how this runs.

1 Like

@apil.tamang, may I please know which “original paper”

I was facing the same issue and after lot of tries somehow reached this article.
@jamesrequa : Your trick did solve the problem. Thanks :slight_smile:
@jeremy : I am running the latest version of code from git but the problem is still there. I have 8GB Ram along with the GPU but the python notebook takes up whole CPU and hangs the system.

1 Like

Which article?

@ecdrid I meant this forum post in which we are talking :stuck_out_tongue:
Sorry for the confusion.

I’ve experienced the same problem, after a fresh Anaconda install and everything up to date, on Ubuntu 17.10. The trick of changing the DataLoader stopped the RAM issue but after a while I get a error about incompatible variables types.

I’ll write here to get notified in case of a solution!