Would love to learn more about what’s going on when these statements are executed. Apologies in advance for the deluge of questions!
dls_lm = DataBlock(
blocks=TextBlock.from_folder(path, is_lm=True),
get_items=get_imdb, splitter=RandomSplitter(0.1)
).dataloaders(path, path=path, bs=128, seq_len=80)
-
The notebook text says TextBlock.from_folder tokenizes the data under path, and I can see the tokenized data saved under .fastai/data/imdb_tok/… Is it correct to assume get\_imdb fetches items from imdb_tok/…?
-
If that’s the case, why do we need dataloaders(path, path=path,...) instead of path=path'\_tok'?
learn = language_model_learner(
dls_lm, AWD_LSTM, drop_mult=0.3,
metrics=[accuracy, Perplexity()]).to_fp16()
-
Does this statement download the pretrained model?
-
When does numericalization happen - during creation of the learner, or on-the-fly as training happens?
-
Why is the dependent variable equal to the independent variable offset by one token - why can’t it just be the last token if that is what we are training the model to predict?
-
The notebook text says: “It is important to maintain order within and across these subarrays, because we will use a model that maintains a state”. With a batch size of b, does this mean the model is tracking b copies of this state during training? Otherwise, wouldn’t we be mixing state from different data across the b mini-streams?