Very slow loading of the Convnet pretrained model on lesson 1

You’re not alone :upside_down:

I’ll try and look deeper into it tomorrow, following your tips.

Does decreasing num_workers when creating data also decrease your memory use?

num_workers=0 generates an error max_workers must be greater than 0 in Cell #35
ims = np.stack([get_augs() for i in range(6)])

Trying num_workers=1: apart from being a lot slower, as my CPU is an I5-4690K with 4 cores (4 workers capable ?), it still slowly but surely filled up the 16Gb RAM before 50% of the 1st epoch and reached another 19Gb of the Swap before finishing, in Cell #41 learn.fit(1e-2, 3, cycle_len=1)

Once finished, the 16Gb of RAM are still fully-used, and the Swap has 7Gb used.

Can you try downloading the latest version of anaconda and installing that from scratch, and doing conda env update there, to see if there’s some module issue going on?

azerty will try

I’m not alone anymore :slight_smile: If you’re looking for a temporary workaround so you can run the notebooks, the following minor changes worked for me:

  1. Comment this line in dataset.py
    #from .dataloader import DataLoader

  2. Change this line in torch_imports.py
    from torch.utils.data import DataLoader, Dataset, TensorDataset (Add DataLoader to the torch.utils.data imports)

6 Likes

Thanks @jamesrequa :+1:

Your 2 lines did the trick, both the Lesson 1 and Lesson 2 notebooks run fine without loading over 10Gb of RAM, and no Swap.

There’s a single error message Widget Javascript not detected. It may not be installed or enabled properly. that keeps popping up at every epoch cell, but I can’t see an actual difference in display with the saved notebooks pre-mods.

1 Like

can relate, i tried reading some of them full of mysteries and cryptic info

Facing Memory and BrokenProcessPool errors

Hey Everyone,

Was everybody else able to resolve their issues, I break the Jupyter notebook everytime at some step because of a MemoryError or a BrokenProcessPool error, and the only thing left at that point is to back start from scratch train the models again and then repeat, it’s kinda frustrating. I think the major problem behind these errors is the Limited memory that comes with the g2.2x large instance having Tesla K520 Grid GPU i.e. 4096MiB. Can’t figure out the issues but I hope someone else can help here, it does get overloaded very quickly with GPU training.

Right now, it seems g2.2x is one of the instances that Amazon is going to retire very soon, since it’s mentions are nowhere to be found in their listings for Accelerated Computing, maybe not many people are using it. But for now I want to make it work.

Things I have tried

Changing the num_workers from default of 8 to 4 and 0 doesn’t work because of AssertionError and changing batch size from 64 to 32 to 2, these sometimes work but make the training extremely slow, which is not what I want right now. Nothing seems to solve process pool errors, that I encounter at some point or other. Another thing that sometimes work is skipping right away to the step I want to perform leaving the other steps behind, but since there’s a sequence @jeremy specifies at the end of notebook, it’s not good enough for state of the art model.

Detailed Error Info

I got this while running learn.fit(lr, 3, cycle_len=1, cycle_mult=2), clueless about this error but it mentions both process pool and calls itself a queuing error, so I think it’s happening because of memory being full.

Note: I only reached this step, as I rebooted my computer, left the intermediary steps and just ran the data augmentation and this part successfully, batch sizes and other options were normal for ImageClassifierData

A Jupyter Widget

  9%|▉         | 34/360 [01:05<10:30,  1.93s/it, loss=0.0872]

Process Process-75:
MemoryError
Traceback (most recent call last):
  File "/home/hannan/.conda/envs/fastai/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/home/hannan/.conda/envs/fastai/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/home/hannan/.conda/envs/fastai/lib/python3.6/concurrent/futures/process.py", line 181, in _process_worker
    result=r))
  File "/home/hannan/.conda/envs/fastai/lib/python3.6/multiprocessing/queues.py", line 341, in put
    obj = _ForkingPickler.dumps(obj)
  File "/home/hannan/.conda/envs/fastai/lib/python3.6/multiprocessing/reduction.py", line 51, in dumps
    cls(buf, protocol).dump(obj)

 10%|▉         | 35/360 [01:08<10:33,  1.95s/it, loss=0.0862]

Exception in thread Thread-24:
Traceback (most recent call last):
  File "/home/hannan/.conda/envs/fastai/lib/python3.6/threading.py", line 916, in _bootstrap_inner
    self.run()
  File "/home/hannan/.conda/envs/fastai/lib/python3.6/threading.py", line 864, in run
    self._target(*self._args, **self._kwargs)
  File "/home/hannan/.conda/envs/fastai/lib/python3.6/concurrent/futures/process.py", line 295, in _queue_management_worker
    shutdown_worker()
  File "/home/hannan/.conda/envs/fastai/lib/python3.6/concurrent/futures/process.py", line 253, in shutdown_worker
    call_queue.put_nowait(None)
  File "/home/hannan/.conda/envs/fastai/lib/python3.6/multiprocessing/queues.py", line 129, in put_nowait
    return self.put(obj, False)
  File "/home/hannan/.conda/envs/fastai/lib/python3.6/multiprocessing/queues.py", line 83, in put
    raise Full
queue.Full


Try doing a git pull. I’ve just changed how this runs.

1 Like

@apil.tamang, may I please know which “original paper”

I was facing the same issue and after lot of tries somehow reached this article.
@jamesrequa : Your trick did solve the problem. Thanks :slight_smile:
@jeremy : I am running the latest version of code from git but the problem is still there. I have 8GB Ram along with the GPU but the python notebook takes up whole CPU and hangs the system.

1 Like

Which article?

@ecdrid I meant this forum post in which we are talking :stuck_out_tongue:
Sorry for the confusion.

I’ve experienced the same problem, after a fresh Anaconda install and everything up to date, on Ubuntu 17.10. The trick of changing the DataLoader stopped the RAM issue but after a while I get a error about incompatible variables types.

I’ll write here to get notified in case of a solution!

@Mirko I’m in the exact same boat as you. Looks like it happened during the prediction on the validation set phase.

Edit: I see now it is during the accuracy calculation after one cycle of training. Probably something to do with the ‘metrics now require tensors’ commit

So I got a hacky workaround for the time being. To avoid the RAM problems I swapped the DataLoaders (from @jamesrequa ) and to deal with the incompatible variable types problem, I added the following line to the top of the accuracy function in fastai.metrics:

targs = targs.cuda()
1 Like

Hello I’m new here in 2018 –

I just got to the

learn.fit(lr, 3, cycle_len=1, cycle_mult=2)

part of the Lesson 1 notebook and jeez did it take forever to run.

Epoch 100% 7/7 [16:14<00:00, 139.15s/it]

I’m running on Windows 10 with a 1080ti, and while it was running I could see 100% utilization of the GPU (but only 50% memory) so I figure it’s not bound by the CPU. Any idea why my it/s is so slow? This seemed like the appropriate thread to bump. Thanks!

1 Like

I have same problem I think.
When I run the code below:

learn.precompute=False
learn.fit(1e-2, 3, cycle_len=1)

GPU crashed and computer rebooted. I use my own GPU Titan Xp and Ubuntu 18.04 system and cuda9.2.

The problem was resolved by pulling the latest from git, @jeremy . Thank you!