I was able to successfully fine-tune a LM using the pre-trained model with the datablock API on a custom dataset. I highlight the (small number of) steps here for documentation:
Assuming, our data is in a pandas dataframe with just different fields that need to be added to the text:
# my dataset consists of name and item_description
data_lm = (TextList.from_df(texts, PATH, cols=['name', 'item_description'])
.random_split_by_pct(0.1)
.label_for_lm() # this does the tokenization and numericalization
.databunch())
data_lm.save('lm-tokens')
# load the data (can be used in the future as well to prevent reprocessing)
data_lm = TextLMDataBunch.load(PATH, 'lm-tokens')
data_lm.show_batch() # take a look at the batch fed into the GPU
learn = language_model_learner(data_lm, pretrained_model=URLs.WT103, drop_mult=0.5, callback_fns=ShowGraph)
learn.lr_find()
learn.recorder.plot()
learn.fit_one_cycle(1, 1e-2, moms=(0.8,0.7))
learn.recorder.plot_losses()
learn.save('fit-head')
learn.load('fit-head')
learn.unfreeze()
learn.lr_find()
learn.recorder.plot()
learn.fit_one_cycle(11, 1e-3, moms=(0.8,0.7))
With a dataset size of 5,635,745, it took me 21 hours, 22 minutes, and 6 seconds to run this on a V100 with a final training loss of 2.697805, valid loss of 2.571279, and accuracy of 0.524987.