Can someone please answer the following question?
Thank you!
Question:
Details:
Hello!
I’m trying to train a text generation LSTM model on a large newspaper dataset (9 GB).
But every time I try to run it, I run out of memory.
I have been trying to implement generators and iterators for the code, but I am having difficulties incorporating it to the model I have already implemented.
I would really appreciate it if you could go over the code and suggest how I can alter the code to make it not crash due to running out of memory.
I am noting the parts in the code which cause the program to run out of memory below as well.
Thank you so much for your help!
df is a 9 GB news article dataset
df = pd.read_json(‘Datasets/data/data.json’)[‘content’]
Is there any way I can simplify the above line so as to fit memory? Any reference please?
wordlist = []
for i in tqdm(range(len(df))):
# wl = create_wordlist(df.iloc[i][‘content’])
wl = create_wordlist(df.iloc[i])
wordlist = wordlist + wl
def str_2d_list(wordlist):
return ’ '.join(str(item) for innerlist in wordlist for item in innerlist)
wordlist = str_2d_list(wordlist)
Is there any way I can simplify the above block so as to fit memory? Any reference please?
count the number of words
word_counts = collections.Counter(wordlist)
Is there any way I can simplify the above line so as to fit memory? Any reference please?
Mapping from index to word : that’s the vocabulary
vocabulary_inv = [x[0] for x in word_counts.most_common()]
vocabulary_inv = list(sorted(vocabulary_inv))
Is there any way I can simplify the above block so as to fit memory? Any reference please?
Mapping from word to index
vocab = {x: i for i, x in enumerate(vocabulary_inv)}
words = [x[0] for x in word_counts.most_common()]
Is there any way I can simplify the above block so as to fit memory? Any reference please?
#size of the vocabulary
vocab_size = len(words)
print("vocab size: ", vocab_size)
#create sequences
seq_length = 30
sequences_step = 1 #step to create sequences
#create sequences
sequences = []
next_words = []
for i in tqdm(range(0, len(wordlist) - seq_length, sequences_step)): # added tqdm() here
sequences.append(wordlist[i: i + seq_length])
next_words.append(wordlist[i + seq_length])
Is there any way I can simplify the above block so as to fit memory? Any reference please?
X = np.zeros((len(sequences), seq_length, vocab_size), dtype=np.bool)
y = np.zeros((len(sequences), vocab_size), dtype=np.bool)
for i, sentence in enumerate(sequences):
for t, word in enumerate(sentence):
X[i, t, vocab[word]] = 1
y[i, vocab[next_words[i]]] = 1
Is there any way I can simplify the above block so as to fit memory? Any reference please?
def bidirectional_lstm_model(seq_length, vocab_size):
print(‘Build LSTM model.’)
model = Sequential()
model.add(Bidirectional(LSTM(rnn_size, activation=“relu”),input_shape=(seq_length, vocab_size)))
model.add(Dropout(0.6))
model.add(Dense(vocab_size))
model.add(Activation(‘softmax’))
#
optimizer = Adam(learning_rate=learning_rate) # lr to learning_rate
callbacks=[EarlyStopping(patience=2, monitor=‘val_loss’)]
model.compile(loss=‘categorical_crossentropy’, optimizer=optimizer, metrics=[categorical_accuracy])
print(“model built!”)
return model
rnn_size = 256 # size of RNN
seq_length = 30 # sequence length
learning_rate = 0.001 #learning rate
model = bidirectional_lstm_model(seq_length, vocab_size)
model.summary()
batch_size = 32 # minibatch size
num_epochs = 50 # number of epochs
callbacks=[EarlyStopping(patience=4, monitor=‘val_loss’),
ModelCheckpoint(filepath= ‘my_model_gen_sentences.{epoch:02d}-{val_loss:.2f}.hdf5’,
monitor=‘val_loss’, verbose=0, mode=‘auto’, period=2)] # period to save_freq korlei genjam lage, cannot save / find val loss
#fit the model
history = model.fit(X, y,
batch_size=batch_size,
shuffle=True,
epochs=num_epochs,
callbacks=callbacks,
validation_split=0.1)
How should I call the memory efficient implementation here?
print(’\a’)
print(’\a’)
print(’\a’)
#Epoch 12/50212423/212423 [==============================]
#- 335s 2ms/step - loss: 331404043.6087
#- categorical_accuracy: 0.0585 - val_loss: 8311385.3984 - val_categorical_accuracy: 0.1540
#Epoch 14/50 │Downloading bangla-newspaper-dataset.zip │6639/6639 [==============================] - 345s 52ms/step - loss: 617827456.0000 │to /home/ratul/Ratul/NLP/Datasets │
- categorical_accuracy: 0.0562 - val_loss: 145312592.0000 - val_categorical_accurac│100%|▉| 1.03G/1.03G [00:41<00:00, 32.6MB/│
y: 0.1300
def sample(preds, temperature=1.0):
# helper function to sample an index from a probability array
preds = np.asarray(preds).astype(‘float64’)
preds = np.log(preds) / temperature
exp_preds = np.exp(preds)
preds = exp_preds / np.sum(exp_preds)
probas = np.random.multinomial(1, preds, 1)
return np.argmax(probas)
words_number = 30 # number of words to generate
seed_sentences = “বৃহস্পতিবার রাতের টিফিন খেয়ে একট” #seed sentence to start the generating.
#initiate sentences
generated = ‘’
sentence = []
#we shate the seed accordingly to the neural netwrok needs:
for i in range (seq_length):
sentence.append(“অভিযোগে”)
seed = seed_sentences.split()
for i in range(len(seed)):
sentence[seq_length-i-1]=seed[len(seed)-i-1]
Is there any way I can simplify the above block so as to fit memory? Any reference please?
generated += ’ '.join(sentence)
#the, we generate the text
for i in range(words_number):
#create the vector
x = np.zeros((1, seq_length, vocab_size))
for t, word in enumerate(sentence):
x[0, t, vocab[word]] = 1.
Is there any way I can simplify the above block so as to fit memory? Any reference please?
#calculate next word
preds = model.predict(x, verbose=0)[0]
next_index = sample(preds, 0.33)
next_word = vocabulary_inv[next_index]
#add the next word to the text
generated += " " + next_word
# shift the sentence by one, and and the next word at its end
sentence = sentence[1:] + [next_word]
Is there any way I can simplify the above block so as to fit memory? Any reference please?
#print the whole text
print(generated)