Lesson 10 Memory Error

Trying to run notebook with over 300,000 text files after running this code:

CLASSES = ['l', 'r']

    def get_texts(path):
        texts,labels = [],[]
        for idx,label in enumerate(CLASSES):
            for fname in (path/label).glob('*.*'):
                texts.append(fname.open('r').read())
                labels.append(idx)
        return np.array(texts),np.array(labels)

    trn_texts,trn_labels = get_texts(PATH/'train')
    val_texts,val_labels = get_texts(PATH/'test')

I get this error:

---------------------------------------------------------------------------
MemoryError                               Traceback (most recent call last)
<ipython-input-4-5fa80fc7c50a> in <module>()
  9     return np.array(texts),np.array(labels)
 10 
---> 11 trn_texts,trn_labels = get_texts(PATH/'train')
 12 val_texts,val_labels = get_texts(PATH/'test')

<ipython-input-4-5fa80fc7c50a> in get_texts(path)
  7             texts.append(fname.open('r').read())
  8             labels.append(idx)
----> 9     return np.array(texts),np.array(labels)
 10 
 11 trn_texts,trn_labels = get_texts(PATH/'train')

MemoryError: 

Any ideas?

Hello!
Not sure if it solves your issue, but have you tried save your data in the same format as in the notebook and then load it by chunks?

def get_all(df, n_lbls):
tok, labels = [], []
for i, r in enumerate(df):
    print(i)
    tok_, labels_ = get_texts(r, n_lbls)
    tok += tok_;
    labels += labels_
return tok, labels

df_trn = pd.read_csv(LM_PATH/'train.csv', header=None, chunksize=chunksize)

df_val = pd.read_csv(LM_PATH/‘test.csv’, header=None, chunksize=chunksize)

Not sure I understand this part:

Are you suggesting running get_all.. cell first?

UPDATE: Forgive me the ignorance. I’ve relistened the lecture, concatenated the files into CSVs and now waiting the load process to finish. It’s working for about 20 minutes already and I’m worried it’s stuck, because for the first few minutes I saw that all 4 CPU cores were working at 100% load, now they are almost idle. I also got this error:


Hope this has nothing to do with the process…

Rebooted the machine and get_all passed. However still got the memory error when trying to save tok_trn to the file.