Part 2 Lesson 10 wiki

jeremy · April 3, 2018, 2:18pm

Good question! I discovered that the test set of the TREC-6 dataset is so small that nearly all reported differences in the literature are statistically meaningless. I think it’s odd that people didn’t report on this - although in the end our paper didn’t mention it either due to space constraints!

However nearly all modern datasets are big enough that confidence intervals are so tiny as to not be an issue.

jeremy · April 3, 2018, 2:22pm

Yeah it’s basically learning all the details of how the stuff we briefly saw in lesson 4 actually works, along with learning how to do that on larger datasets, faster, using the new fastai.text library (that didn’t exist in part 1). Along with transfer learning on wt103 of course.

jeremy · April 3, 2018, 2:24pm

That’s why I’m recommending running VNC on the server. So you don’t have to log output and can’t lose data.

jeremy · April 3, 2018, 2:25pm

Wiki edits don’t require approval from deities - go ahead and edit!

jeremy · April 3, 2018, 2:26pm

Excellent summary.

jeremy · April 3, 2018, 2:27pm

Currently go through review, so can’t share. But the results are all in the ppt.

chunduri · April 3, 2018, 2:55pm

thanks for the great video which explains perplexity very clearly.
One question, why is perplexity of unigrams > bigrams > trigrams?

keratin · April 3, 2018, 3:21pm

@chunduri n-gram models look at n previous words to predict the next one. So, a unigram model looks at just the previous word, bi-gram looks at 2 words and so on. The number of choices which are available for a word reduce significantly as we look at more previous words for e.g ’ a ___ ’ and ’ drink a ___ '. The second blank can be filled with less things than the first, so if we look at perplexity in terms of branching factor as explained in the video, the number of branches it results in reduces, hence the perplexity is lower.

Even · April 3, 2018, 3:33pm

You only need a pid if you have multiple screens. At least that’s how it works on the ubuntu system i’m connecting too.

Even · April 3, 2018, 3:34pm

nohup works too afaik but I haven’t used it.

narvind2003 · April 3, 2018, 3:41pm

I use it all the time…I leave the notebooks running on a remote server as a nohup process for days…I just refresh my browser page and pick up from where I left…all variables are still there…

veskd · April 3, 2018, 5:26pm

df_trn = pd.DataFrame({‘text’:trn_texts, ‘labels’:[0]*len(trn_texts)}, columns=col_names)
The above line in the imdb notebook seems to make all the labels equal to 0 in the data frame. Is that a bug or am I missing something here?

jamesrequa · April 3, 2018, 8:04pm

@sermakarevich you should try using a LM with your toxic comment solution and see how much it (hopefully) improves

jeremy · April 3, 2018, 8:30pm

sigh after finishing the paper I tried the LM approach on the toxic comment comp, and found it was the 1st dataset where it doesn’t help

The issue, I’m guessing, is that many of the labels are extremely rare. But I didn’t have time to study it closely.

jeremy · April 3, 2018, 8:31pm

I’ve posted the video to the top post now.

jamesrequa · April 3, 2018, 8:50pm

I wonder if it would help the language model at all to include some attempt at representing the etymology of words i.e. Latin, Greek, etc. Or is that just compeltely crazy?

veskd · April 3, 2018, 9:01pm

In the imdb notebook inside get_texts(df, n_lbls=1)
the following line:
for i in range(n_lbls+1, len(df.columns)): texts += f’ {FLD} {i-n_lbls} ’ + df[i].astype(str)
I feel should be changed to:
for i in range(n_lbls+1, len(df.columns)): texts += f’ {FLD} {i-n_lbls+1} ’ + df[i].astype(str)
Otherwise we will end up with 2 fields that have xfld=1

chunduri · April 3, 2018, 9:04pm

you mean root words, which could be common to different language groups. sounds like a great idea.
jeremy was talking about sub-words in class, which divides each words into its roots I think is close to this idea.

mcleavey · April 3, 2018, 9:08pm

I’m struggling with keeping Focal Loss from running out of memory (I’m trying to rewrite it since here there are so many target classes). I’m running the hinge loss version now (that was easier since there’s already a version in PyTorch).

miguel_perez · April 3, 2018, 9:11pm

Actually I needed this lesson, the emphasis in conceptual difference: Language models vs. custom embeddings

Somehow I didn’t get such a clear picture after part 1. My mental summary after class 4 part1 was “ok, custom embeddings”. So wrong! (My bad, I’ve rewatched the lesson and it was all already there, crystal clear).

But now finally after this lesson I think I got that “crux” of language model approach to transfer learning. I usually consider if I can not summarize an idea with a few simple sentences probably I dont really have the idea, so I tentatively would try to summarize:

-It is, but no not so much about custom embeddings “initialized” learning the structure of english.
-It is, but no not so much about letting custom embeddings learn classification task

-It is, much more, about both tasks sharing the architecture.

Probably I will reconsider this summary after a couple of more rewatches of lesson but as I said, really usefull all the times both Rachel and Jeremy emphasized “we are not using embeddings, but a language model”. After four or five times of hearing it the “heads up” worked.