Training 1.2 billion keywords

gerardo · February 27, 2019, 8:39pm

I have the following code that include 1.2 billion keywords from my corpus.
My 64 GB RAM computer does not allow me to load the whole thing at once.
I’m trying to chunks and reuse whatever was already learned o continue with the next 100K keywords.

The issue is
else:
fname = f’tokenized_single_model_part{index-1}’
learn = language_model_learner(data, AWD_LSTM, drop_mult=0.5, pretrained_fnames=fname)
It’s ignoring the fname that I’m providing
What is needed to save and then load for the next cycle.

index = 0
for df in pd.read_csv(OUT_PATH/‘model/tokenized_single.txt.gz’, header=None, chunksize=100000):
print (f’Started {index}’)
df.dropna(inplace=True)
valid_pct = 0.2
df = df.iloc[np.random.permutation(len(df))]
cut = int(valid_pct * len(df)) + 1
print (len(df[cut:]), len(df[:cut]))
test_df = None
data = TextLMDataBunch.from_df(OUT_PATH, train_df=df[cut:], valid_df=df[:cut], text_cols=0, bs=300)
if index == 0:
learn = language_model_learner(data, AWD_LSTM, drop_mult=0.5)
else:
fname = f’tokenized_single_model_part{index-1}’
learn = language_model_learner(data, AWD_LSTM, drop_mult=0.5, pretrained_fnames=fname)
learn.unfreeze()
learn.fit_one_cycle(2, 1e-2)
learn.save(f’tokenized_single_model_part{index}’)
learn.save_encoder(f’tokenized_single_model_enc_part{index}’)
print (f’Finished {index}’)
index += 1