If you want to try to figure it out and summarize your best understanding, I’d be happy to fill in any missing pieces for you. If you’re not familiar with the CS concept of ‘reduce’ you may want to google that…
FYI this is called a “fold” or “reduce” operation. You can learn more about them, including why you need to specify the initial [] starting point, here: Fold (higher-order function) - Wikipedia
Why are you making all the labels = 0 in the training/validation dataframes for the language model dataset? Given that these are ignored in language modeling, I don’t understand why we don’t just use the labels as is.
In def get_texts(df, n_lbls=1): you add a \nxbos xfld 1 to the beginning of each document, but why? And is there a reason you don’t included an EOS tag?
I guess Jeremy mentioned in the lesson that these tags are for signaling the network if a new text block or field has started so it can (learn to) reset its internal state.
I was also wondering about the labels = 0 step, but I also don’t have an answer. Maybe the labels are not ignored at the LM training and therefore must be set to the same value?
I remember Jeremy mentioned it somewhere that since we do not need a dependent category variable y in the language model thus we just set them all to 0’s. Hopefully this helps