Out of Memory (OOM) issues with P2 instance when running Lesson2 notebook

tiffanyrwilliams · November 1, 2016, 11:59pm

When I attempt to go through the Lesson2 notebook, I run into a memory issue at the following line:

In an attempt to address the issue, I have done the following:

stopped all other notebooks currently running
restart the Lesson2 notebook kernel
stop/restart the instance

After restarting the Lesson2 notebook, I downloaded htop (an interactive process viewer) on my instance in order to look at all of the running processes and memory usage. Initially, the Memory usage was fairly low; however, by the time I get to the following lines:

save_array(model_path+ 'train_data.bc', trn_data)
save_array(model_path + 'valid_data.bc', val_data)

my memory usage is at 55.2GB/60GB. No other cells in the notebook are running at the time I check the memory usage.

Here’s some information on my instance:

Instance type: p2.xlarge
Network interfaces: eth0
Source/dest. check: True
EBS-optimized: False
Root device: /dev/sda1
Block devices: /dev/sda1
Availability zone: us-west-2a
AMI ID: fastai-dl-01
Virtualization: hvm

When I df -h – here’s the system info I get back:

Filesystem Size Used Avail Use% Mounted on
udev 30G 0 30G 0% /dev
tmpfs 6.0G 8.9M 6.0G 1% /run
/dev/xvda1 126G 18G 104G 15% /
tmpfs 30G 0 30G 0% /dev/shm
tmpfs 5.0M 0 5.0M 0% /run/lock
tmpfs 30G 0 30G 0% /sys/fs/cgroup
tmpfs 6.0G 12K 6.0G 1% /run/user/1000

Here’s a screenshot of the processes running by the time I get to the load_array:

I hope I’ve given enough information Has anyone else run into this problem? Any ideas on ways to get around the memory issue?

jeremy · November 2, 2016, 12:36am

What are trn_data.shape and val_data.shape? What if you split the cell in two, so you first do val_data, and then do trn_data? Does saving just val_data work (since it’s quite a bit smaller)?

tiffanyrwilliams · November 2, 2016, 5:54am

Hi @jeremy.

trn_data.shape = (23000, 3, 224, 224)
val_data.shape =  (2000, 3, 224, 224)

I also split

save_array(model_path+ 'train_data.bc', trn_data)
save_array(model_path + 'valid_data.bc', val_data)

into 2 cells. What I noticed is that the memory went up to 55.2 with the first save_array (with the train data). After the save_array(model_path+ 'train_data.bc', trn_data) cell ran, the memory was sustained. So by the time I got to the cell with just trn_data = load_array(model_path+'train_data.bc'), I immediately got a memory error.

So, it does seem as if just saving the val_data would work. However, I’m not sure why I’m running into this issue and apparently others are not – do other people have greater memory?

jeff · November 2, 2016, 7:51am

I have the same issue on a P2 instance. But, I’m not worried about it since we already have the training data and validation data arrays loaded into memory at this point in the notebook. No need to reload them again (unless, of course, you restart your notebook and want to continue where you left off). If you want, you can add del trn_data and del val_data before the load_array() calls. That’ll free up the memory occupied by those variables before loading them back up from the saved data.

jeremy · November 2, 2016, 2:41pm

Exactly, @jeff ! You only want to run load_array() if you’re wanting to avoid waiting for get_data(), and you’ve run get_data() before and saved the results. There’s no reason to run load_array() immediately after running save_array() - you already have the array calculated, so there’s no need to load it.