Wordvectors preprocessing for very large (>1gb) glove pretrained vectors

Hi everyone,

The py notebook wordvectors is used to process pretrained weights from glove. And the following codes were used.

def get_glove(name):
    with open(path+ 'glove.' + name + '.txt', 'r', encoding='utf8') as f: lines = [line.split() for line in f]
    words = [d[0] for d in lines]
    vecs = np.stack(np.array(d[1:], dtype=np.float32) for d in lines)
    wordidx = {o:i for i,o in enumerate(words)}
    save_array(res_path+name+'.dat', vecs)
    pickle.dump(words, open(res_path+name+'_words.pkl','wb'))
    pickle.dump(wordidx, open(res_path+name+'_idx.pkl','wb'))

When I tried to process very large pretrained files (e.g. glove.42B.300d), I run into RAM limitation since I only have 16gb. So is there a way to modify the code such that the system reads the pretrained file line by line, instead of loading everything into RAM at once?

Thank you.

Just to share my work around for this.

I split the pretrained vectors into words and vetors, 2 txt files.

Then read the vectors txt into numpy, and subsequently saved them in a h5py file.

Slower, but doesnt eat up so much RAM at once.

Anyway u just need to do this once, so the slower part is really not that bad.

Please let me know if you have a better way at doing this.