How can I solve memory issues while training LSTM model in Keras (Pythom) on large dataset for text generation?

Can someone please answer the following question?

Thank you!


Question:

Details:
Hello!
I’m trying to train a text generation LSTM model on a large newspaper dataset (9 GB).
But every time I try to run it, I run out of memory.
I have been trying to implement generators and iterators for the code, but I am having difficulties incorporating it to the model I have already implemented.

I would really appreciate it if you could go over the code and suggest how I can alter the code to make it not crash due to running out of memory.

I am noting the parts in the code which cause the program to run out of memory below as well.

Thank you so much for your help!

df is a 9 GB news article dataset

df = pd.read_json(‘Datasets/data/data.json’)[‘content’]

Is there any way I can simplify the above line so as to fit memory? Any reference please?

wordlist = []

for i in tqdm(range(len(df))):
# wl = create_wordlist(df.iloc[i][‘content’])
wl = create_wordlist(df.iloc[i])
wordlist = wordlist + wl

def str_2d_list(wordlist):
return ’ '.join(str(item) for innerlist in wordlist for item in innerlist)

wordlist = str_2d_list(wordlist)

Is there any way I can simplify the above block so as to fit memory? Any reference please?

count the number of words

word_counts = collections.Counter(wordlist)

Is there any way I can simplify the above line so as to fit memory? Any reference please?

Mapping from index to word : that’s the vocabulary

vocabulary_inv = [x[0] for x in word_counts.most_common()]
vocabulary_inv = list(sorted(vocabulary_inv))

Is there any way I can simplify the above block so as to fit memory? Any reference please?

Mapping from word to index

vocab = {x: i for i, x in enumerate(vocabulary_inv)}
words = [x[0] for x in word_counts.most_common()]

Is there any way I can simplify the above block so as to fit memory? Any reference please?

#size of the vocabulary
vocab_size = len(words)
print("vocab size: ", vocab_size)

#create sequences
seq_length = 30
sequences_step = 1 #step to create sequences

#create sequences
sequences = []
next_words = []
for i in tqdm(range(0, len(wordlist) - seq_length, sequences_step)): # added tqdm() here
sequences.append(wordlist[i: i + seq_length])
next_words.append(wordlist[i + seq_length])

Is there any way I can simplify the above block so as to fit memory? Any reference please?

X = np.zeros((len(sequences), seq_length, vocab_size), dtype=np.bool)
y = np.zeros((len(sequences), vocab_size), dtype=np.bool)
for i, sentence in enumerate(sequences):
for t, word in enumerate(sentence):
X[i, t, vocab[word]] = 1
y[i, vocab[next_words[i]]] = 1

Is there any way I can simplify the above block so as to fit memory? Any reference please?

def bidirectional_lstm_model(seq_length, vocab_size):
print(‘Build LSTM model.’)
model = Sequential()
model.add(Bidirectional(LSTM(rnn_size, activation=“relu”),input_shape=(seq_length, vocab_size)))
model.add(Dropout(0.6))
model.add(Dense(vocab_size))
model.add(Activation(‘softmax’))
#
optimizer = Adam(learning_rate=learning_rate) # lr to learning_rate
callbacks=[EarlyStopping(patience=2, monitor=‘val_loss’)]
model.compile(loss=‘categorical_crossentropy’, optimizer=optimizer, metrics=[categorical_accuracy])
print(“model built!”)
return model

rnn_size = 256 # size of RNN
seq_length = 30 # sequence length
learning_rate = 0.001 #learning rate

model = bidirectional_lstm_model(seq_length, vocab_size)
model.summary()

batch_size = 32 # minibatch size
num_epochs = 50 # number of epochs

callbacks=[EarlyStopping(patience=4, monitor=‘val_loss’),
ModelCheckpoint(filepath= ‘my_model_gen_sentences.{epoch:02d}-{val_loss:.2f}.hdf5’,
monitor=‘val_loss’, verbose=0, mode=‘auto’, period=2)] # period to save_freq korlei genjam lage, cannot save / find val loss
#fit the model
history = model.fit(X, y,
batch_size=batch_size,
shuffle=True,
epochs=num_epochs,
callbacks=callbacks,
validation_split=0.1)

How should I call the memory efficient implementation here?

print(’\a’)
print(’\a’)
print(’\a’)

#Epoch 12/50212423/212423 [==============================]
#- 335s 2ms/step - loss: 331404043.6087
#- categorical_accuracy: 0.0585 - val_loss: 8311385.3984 - val_categorical_accuracy: 0.1540

#Epoch 14/50 │Downloading bangla-newspaper-dataset.zip │6639/6639 [==============================] - 345s 52ms/step - loss: 617827456.0000 │to /home/ratul/Ratul/NLP/Datasets │

  • categorical_accuracy: 0.0562 - val_loss: 145312592.0000 - val_categorical_accurac│100%|▉| 1.03G/1.03G [00:41<00:00, 32.6MB/│
    y: 0.1300

def sample(preds, temperature=1.0):
# helper function to sample an index from a probability array
preds = np.asarray(preds).astype(‘float64’)
preds = np.log(preds) / temperature
exp_preds = np.exp(preds)
preds = exp_preds / np.sum(exp_preds)
probas = np.random.multinomial(1, preds, 1)
return np.argmax(probas)

words_number = 30 # number of words to generate
seed_sentences = “বৃহস্পতিবার রাতের টিফিন খেয়ে একট” #seed sentence to start the generating.

#initiate sentences
generated = ‘’
sentence = []

#we shate the seed accordingly to the neural netwrok needs:
for i in range (seq_length):
sentence.append(“অভিযোগে”)

seed = seed_sentences.split()

for i in range(len(seed)):
sentence[seq_length-i-1]=seed[len(seed)-i-1]

Is there any way I can simplify the above block so as to fit memory? Any reference please?

generated += ’ '.join(sentence)

#the, we generate the text
for i in range(words_number):
#create the vector
x = np.zeros((1, seq_length, vocab_size))
for t, word in enumerate(sentence):
x[0, t, vocab[word]] = 1.

Is there any way I can simplify the above block so as to fit memory? Any reference please?

#calculate next word
preds = model.predict(x, verbose=0)[0]
next_index = sample(preds, 0.33)
next_word = vocabulary_inv[next_index]

#add the next word to the text
generated += " " + next_word
# shift the sentence by one, and and the next word at its end
sentence = sentence[1:] + [next_word]

Is there any way I can simplify the above block so as to fit memory? Any reference please?

#print the whole text
print(generated)

Hi Scarlet

Keras is part of Tensor Flow. This is mainly a Fastai PyTorch forum. Have you tried the Keras forums such as Keras - TensorFlow Forum

Regards Conwyn