Wordvectors preprocessing for very large (>1gb) glove pretrained vectors

anuclearbomb · November 6, 2017, 4:15pm

Hi everyone,

The py notebook wordvectors is used to process pretrained weights from glove. And the following codes were used.

def get_glove(name):
    with open(path+ 'glove.' + name + '.txt', 'r', encoding='utf8') as f: lines = [line.split() for line in f]
    words = [d[0] for d in lines]
    vecs = np.stack(np.array(d[1:], dtype=np.float32) for d in lines)
    wordidx = {o:i for i,o in enumerate(words)}
    save_array(res_path+name+'.dat', vecs)
    pickle.dump(words, open(res_path+name+'_words.pkl','wb'))
    pickle.dump(wordidx, open(res_path+name+'_idx.pkl','wb'))

When I tried to process very large pretrained files (e.g. glove.42B.300d), I run into RAM limitation since I only have 16gb. So is there a way to modify the code such that the system reads the pretrained file line by line, instead of loading everything into RAM at once?

Thank you.

anuclearbomb · November 9, 2017, 2:46am

Just to share my work around for this.

I split the pretrained vectors into words and vetors, 2 txt files.

Then read the vectors txt into numpy, and subsequently saved them in a h5py file.

Slower, but doesnt eat up so much RAM at once.

Anyway u just need to do this once, so the slower part is really not that bad.

Please let me know if you have a better way at doing this.