Training language model from scratch with pre-tokenized text

jodiak · December 20, 2020, 10:05pm

Hello everyone,
I’ve been playing around with fastai for text applications and was wanting to try training a language model from scratch using my own tokenized text. I have a LMDataLoader that I’ve instantiated using my list of numericalized texts, however when I attempt to create a Learner I am getting the following error

learn = language_model_learner(dl, AWD_LSTM)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/fastai/text/learner.py", line 194, in language_model_learner
    vocab = _get_text_vocab(dls)
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/fastai/text/learner.py", line 186, in _get_text_vocab
    vocab = dls.vocab
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/fastcore/basics.py", line 378, in __getattr__
    if attr is not None: return getattr(attr,k)
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/fastcore/basics.py", line 378, in __getattr__
    if attr is not None: return getattr(attr,k)
AttributeError: 'list' object has no attribute 'vocab'

# ex: "words" is list of numericalized texts: 
# [[287, 85, 66, 1, 287, 36...], [ 72, 287, 152, 46, 6...],...]
# 
bs,sl = 4,50
ints = L(*words).map(tensor)

dl = LMDataLoader(ints_l, bs=bs, seq_len=sl, shuffle=True)
learn = language_model_learner(dl, AWD_LSTM) # error occurs on creation

Do I need to manually pass a vocabulary to a Learner or DataLoader? Apologies if I’ve missed something obvious I am grokking much of this library as I go along with the course.