Chapter 4: What's happening behind the scenes with DataBlock, TextBlock, language_model_learner?

macdonnellroadster · June 9, 2023, 7:48am

Would love to learn more about what’s going on when these statements are executed. Apologies in advance for the deluge of questions!

dls_lm = DataBlock(
    blocks=TextBlock.from_folder(path, is_lm=True),
    get_items=get_imdb, splitter=RandomSplitter(0.1)
).dataloaders(path, path=path, bs=128, seq_len=80)

The notebook text says TextBlock.from_folder tokenizes the data under path, and I can see the tokenized data saved under .fastai/data/imdb_tok/… Is it correct to assume get\_imdb fetches items from imdb_tok/…?
If that’s the case, why do we need dataloaders(path, path=path,...) instead of path=path'\_tok'?

learn = language_model_learner(
    dls_lm, AWD_LSTM, drop_mult=0.3, 
    metrics=[accuracy, Perplexity()]).to_fp16()

Does this statement download the pretrained model?
When does numericalization happen - during creation of the learner, or on-the-fly as training happens?
Why is the dependent variable equal to the independent variable offset by one token - why can’t it just be the last token if that is what we are training the model to predict?
The notebook text says: “It is important to maintain order within and across these subarrays, because we will use a model that maintains a state”. With a batch size of b, does this mean the model is tracking b copies of this state during training? Otherwise, wouldn’t we be mixing state from different data across the b mini-streams?

macdonnellroadster · June 14, 2023, 9:40am

Questions 5 and 6 are answered in the RNN portion of Fast AI Video Viewer