Fastai.text language modeling example

wgpubs · March 8, 2018, 8:30pm

So I’m probably doing a lot of things wrong, but here it is … a working example of using the fastai.text package to do language modeling.

gist.github.com

https://gist.github.com/ohmeow/314434261415a164214e4642c0dafc94

tmp-fastai-text-imdb.ipynb

{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "%reload_ext autoreload\n",
    "%autoreload 2\n",

This file has been truncated. show original

Would love and appreciate any and all feedback! What am I doing wrong and what can be improved???

If you have questions, let me know. I’m planning to make this a blog post after getting feedback here.

Thanks - wg

hiromi · March 9, 2018, 6:20am

I’m impressed

When I was reading through text.py last night, I was totally puzzled by how a vocabulary or a map that numericalizes texts would get built.

After looking at your code and how you created trn_docs_pp by calling Tokenizer.proc_all, you might still be able to use torchtext:

from torchtext import data
TEXT = data.Field()
TEXT.build_vocab(trn_docs_pp, val_docs_pp)

I might be using it wrong, but a quick sanity check seems to return what I would expect:

And I was thinking maybe we can send something like above nums to LanguageModelLoader constructor?

By the way, I stole this line text = sum(trn_docs_pp, []) from nlp.py and still can’t figure out why it flattens the list… Kind of a neat trick

wgpubs · March 9, 2018, 6:30am

Nice insight there @hiromi.

Looks like torchtext can still be helpful in building the vocab in fastai.text. Will be interesting to see how Jeremy sets things up.

jeremy · March 9, 2018, 4:55pm

It’s designed to work without torchtext. I’ll starting working on a notebook now

jeremy · March 9, 2018, 4:56pm

If you want to try to figure it out and summarize your best understanding, I’d be happy to fill in any missing pieces for you. If you’re not familiar with the CS concept of ‘reduce’ you may want to google that…

hiromi · March 9, 2018, 7:41pm

Here is my understanding in gist

gist.github.com

https://gist.github.com/hiromis/87cbbd7c1a47f22559a33017f36f6cdd

flatten.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## My understanding\n",
    "In Python, you can concatinate lists by using `+`"
   ]
  },

This file has been truncated. show original

Vishucyrus · March 10, 2018, 6:31am

Nice Explanation…

ecdrid · March 10, 2018, 7:31am

Can someone recommend from where to start studying NLP?
Preferably from notebooks
Thanks

jeremy · March 10, 2018, 5:18pm

As promised, here’s a notebook showing how to use fastai.text without torchtext.

gist.github.com

https://gist.github.com/anonymous/0dd0df21cf404cf2bb51d0148c8b7d8b

imdb.ipynb

{
  "cells": [
    {
      "metadata": {},
      "cell_type": "markdown",
      "source": "## IMDb"
    },
    {
      "metadata": {
        "trusted": true

This file has been truncated. show original

jeremy · March 10, 2018, 5:20pm

Looks like you didn’t need me after all

FYI this is called a “fold” or “reduce” operation. You can learn more about them, including why you need to specify the initial [] starting point, here: Fold (higher-order function) - Wikipedia

hiromi · March 10, 2018, 6:26pm

I have a great teacher

Thank you for the reference and the notebook! Now the mystery of “what would Jeremy do” with fastai.text has been solved

wgpubs · March 21, 2018, 7:14pm

Going through gist and I have a few questions:

Why are you making all the labels = 0 in the training/validation dataframes for the language model dataset? Given that these are ignored in language modeling, I don’t understand why we don’t just use the labels as is.
In def get_texts(df, n_lbls=1): you add a \nxbos xfld 1 to the beginning of each document, but why? And is there a reason you don’t included an EOS tag?

MicPie · September 2, 2018, 2:37pm

I guess Jeremy mentioned in the lesson that these tags are for signaling the network if a new text block or field has started so it can (learn to) reset its internal state.

I was also wondering about the labels = 0 step, but I also don’t have an answer. Maybe the labels are not ignored at the LM training and therefore must be set to the same value?

Best regards
Michael

zdw2126062 · November 28, 2018, 7:35pm

I remember Jeremy mentioned it somewhere that since we do not need a dependent category variable y in the language model thus we just set them all to 0’s. Hopefully this helps