Fastai.text language modeling example

So I’m probably doing a lot of things wrong, but here it is … a working example of using the fastai.text package to do language modeling.

Would love and appreciate any and all feedback! What am I doing wrong and what can be improved???

If you have questions, let me know. I’m planning to make this a blog post after getting feedback here.

Thanks - wg


I’m impressed :slight_smile:

When I was reading through last night, I was totally puzzled by how a vocabulary or a map that numericalizes texts would get built.

After looking at your code and how you created trn_docs_pp by calling Tokenizer.proc_all, you might still be able to use torchtext:

from torchtext import data
TEXT = data.Field()
TEXT.build_vocab(trn_docs_pp, val_docs_pp)

I might be using it wrong, but a quick sanity check seems to return what I would expect:

And I was thinking maybe we can send something like above nums to LanguageModelLoader constructor?

By the way, I stole this line text = sum(trn_docs_pp, []) from and still can’t figure out why it flattens the list… Kind of a neat trick :blush:


Nice insight there @hiromi.

Looks like torchtext can still be helpful in building the vocab in fastai.text. Will be interesting to see how Jeremy sets things up.

It’s designed to work without torchtext. I’ll starting working on a notebook now :slight_smile:


If you want to try to figure it out and summarize your best understanding, I’d be happy to fill in any missing pieces for you. :slight_smile: If you’re not familiar with the CS concept of ‘reduce’ you may want to google that…


Here is my understanding in gist :slight_smile:


Nice Explanation…

1 Like

Can someone recommend from where to start studying NLP?
Preferably from notebooks

As promised, here’s a notebook showing how to use fastai.text without torchtext.


Looks like you didn’t need me after all :slight_smile:

FYI this is called a “fold” or “reduce” operation. You can learn more about them, including why you need to specify the initial [] starting point, here:


I have a great teacher :slight_smile:

Thank you for the reference and the notebook! Now the mystery of “what would Jeremy do” with fastai.text has been solved :tada:


Going through gist and I have a few questions:

  1. Why are you making all the labels = 0 in the training/validation dataframes for the language model dataset? Given that these are ignored in language modeling, I don’t understand why we don’t just use the labels as is.

  2. In def get_texts(df, n_lbls=1): you add a \nxbos xfld 1 to the beginning of each document, but why? And is there a reason you don’t included an EOS tag?

I guess Jeremy mentioned in the lesson that these tags are for signaling the network if a new text block or field has started so it can (learn to) reset its internal state.

I was also wondering about the labels = 0 step, but I also don’t have an answer. Maybe the labels are not ignored at the LM training and therefore must be set to the same value?

Best regards

I remember Jeremy mentioned it somewhere that since we do not need a dependent category variable y in the language model thus we just set them all to 0’s. Hopefully this helps :slight_smile:

1 Like