Run Time Error on lesson 2

scafati98 · August 27, 2020, 1:36am

Hello everyone. Im currently taking the fast.ai 2020 course and I encountered a problem on lesson 2. When I try to run the learner I receive the following error:

RunTimeError: DataLoader worker (pid 4509) is killed by signal: Killed.

Then if I insist and run it again I receive:

OSError: [Errno 12] Cannot allocate memory

Im working with a paperspace gradient GPU over a Linux VM. Does someone has a clue on how to solve this issue?

Thanks!

scafati98 · August 27, 2020, 3:30am

Its weird because yesterday I was able to run It without any inconvenience… I tried solving it by setting the num_workers = 0 but the same problem appears, the kernel dies or it takes way to long. Same with reducing the batch size.

Kornel · August 27, 2020, 12:18pm

num_workers is number of CPUs so if you reduce it to 0 it will slow down obviously, but it should not trigger out of memory

check If your GPU is working
if you would exeed GPU memory you should get worning "cuda out of memory" instead of "out of memory"

import torch
torch.cuda.is_available()

If GPU is ok, try to reduce batch size even more (bs parameter in .dataloader(...))
If still not working reduce size of the images.

If still not, you may want to use memory profiler (I don’t know If Gradient have built in) to find which function causes Error.

Restart notebook to clear memory and check If memory is busy before running first cell which would mean that some other proces have to be shut down

scafati98 · August 27, 2020, 12:50pm

@Kornel Thank you for your response, I appreciate it. I will try all of the above and, In case it fails, how can I check if memory is busy before running first cell? With the memory profiler? Just in case, Is it possible that because Im not storing the images and the new notebook on the /storage folder this is causing the problem?
And I correct myself, It doesnt say “Out of Memory” It says:

OSError: [Errno 12] Cannot allocate memory

scafati98 · August 27, 2020, 1:04pm

I ran:

import torch
torch.cuda.is_available()

and receive:

False

What does this means?

scafati98 · August 27, 2020, 1:15pm

Pretty sure ive solved it, i was renting a Free-CPU with 2gb RAM. Now I managed to rent a GPU with 30gb of RAM. Lets see…

Kornel · August 27, 2020, 1:36pm

This mean that your GPU is not running. You should checkout settings

run in console

nvidia-smi

and check if there is any GPU available

scafati98 · August 27, 2020, 2:20pm

Solved! Thank you very much @Kornel!!!

jsdw · September 6, 2020, 4:18pm

I am having memory issues as well, attempting to train a model locally on a computer without a viable GPU. I have cut the training set down to 37 images now (~3MB each) and have code like:

from fastai.vision.all import *
from pathlib import Path

path = Path('path/to/images')
images = get_image_files(path)

def labels(name):
    return name.parts[-2]

dls = ImageDataLoaders.from_path_func(path, 
                       bs=5, fnames=images, valid_pct=0.2, seed=0, 
                       label_func=labels, item_ftms=Resize(100))

learn = cnn_learner(dls, resnet18, metrics=error_rate)
learn.fine_tune(1)

So, I’ve set bs to a low value (as I gather that helps) and am using the smallest resnet arch I think, as well as resizing images to 100x100 (I think that’s what Resize(100) does, at least…).

Despite this, as soon as I start running this in a jupyter notebook I see the reported memory usage balloon to ~25GB of RAM in top (I only have 8GB on this machine) before the jupyter kernel crashes (I don’t see any progress on the progress bars at all before this happens).

I’m using:

python 3.8.5,
fastai 2.0.8,
notebook 6.1.3

I’ve no idea what’s causing such massive memory usage; is ~25GB expected for such a small data set and the above code? What can I do to reduce memory usage (aside from using a GPU)?

meysa · November 9, 2020, 2:23am

Hi Eugenio,

I came up with exactly the same problem, I am working on gradient free account which gives me 5Gb of memory. Could you please explain how did you solve the problem?

Jakub · January 1, 2021, 7:17pm

This also solved the problem for me.