How to reduce memory usage in NLP data processing on Colab

darrenlmd · October 16, 2018, 4:49pm

Hi all,

I’m trying to run the imdb note book from Lesson 10 on Google Colab. I basically use the same notebook with this set up code, taken from one of the discussion here

!pip uninstall fastai
!pip install Pillow==4.1.1
!pip install "fastai==0.7.0"
!pip install scipy==1.0.0
!pip install pandas==0.23.4
!pip install torchtext==0.2.3
!apt-get -qq install -y libsm6 libxext6 && pip install -q -U opencv-python
import cv2
from os import path
from wheel.pep425tags import get_abbr_impl, get_impl_ver, get_abi_tag
platform = '{}{}-{}'.format(get_abbr_impl(), get_impl_ver(), get_abi_tag())
!apt update && apt install -y libsm6 libxext6

accelerator = 'cu80' if path.exists('/opt/bin/nvidia-smi') else 'cpu'
!pip install -q http://download.pytorch.org/whl/{accelerator}/torch-0.3.0.post4-{platform}-linux_x86_64.whl torchvision
import torch
!pip install image

%matplotlib inline
from fastai.imports import *

Things run well until this line, where the run time dies and restarts every time

trn_texts,val_texts = sklearn.model_selection.train_test_split(
    np.concatenate([trn_texts,val_texts]), test_size=0.1)

It seems like I run out of memory on the VM but I’m not sure how to fix it. Any idea as to the cause and solution for this issue?

Thanks!

darrenlmd · October 17, 2018, 3:24am

UPDATE:

I notice actually the RAM usage shoots up to about 9GB after I do this

np.random.seed(42) # use same seed to randomize in similar way
trn_idx = np.random.permutation(len(trn_texts))
val_idx = np.random.permutation(len(val_texts))
trn_texts = trn_texts[trn_idx]
val_texts = val_texts[val_idx]
trn_labels = trn_labels[trn_idx]
val_labels = val_labels[val_idx]

Why are the RAM usage so high here as compared to

trn_texts,trn_labels = get_texts(PATH/'train')
val_texts,val_labels = get_texts(PATH/'test')

Dinghai · December 1, 2018, 9:04pm

I faced the same problem. To reduce memory usage, I deleted nonessential variables as soon as possible. Here is my code to replace the troublemaking cell:

all_texts = np.concatenate([trn_texts, val_texts])
del trn_texts
del val_texts
trn_texts, val_texts = sklearn.model_selection.train_test_split(all_texts, test_size=0.1)
del all_texts