Part 2 Lesson 10 wiki

df_trn = pd.DataFrame({‘text’:trn_texts, ‘labels’:[0]*len(trn_texts)}, columns=col_names)
The above line in the imdb notebook seems to make all the labels equal to 0 in the data frame. Is that a bug or am I missing something here?

@sermakarevich you should try using a LM with your toxic comment solution and see how much it (hopefully) improves :slight_smile:

1 Like

sigh after finishing the paper I tried the LM approach on the toxic comment comp, and found it was the 1st dataset where it doesn’t help :frowning:

The issue, I’m guessing, is that many of the labels are extremely rare. But I didn’t have time to study it closely.

2 Likes

I’ve posted the video to the top post now.

1 Like

I wonder if it would help the language model at all to include some attempt at representing the etymology of words i.e. Latin, Greek, etc. Or is that just compeltely crazy?

In the imdb notebook inside get_texts(df, n_lbls=1)
the following line:
for i in range(n_lbls+1, len(df.columns)): texts += f’ {FLD} {i-n_lbls} ’ + df[i].astype(str)
I feel should be changed to:
for i in range(n_lbls+1, len(df.columns)): texts += f’ {FLD} {i-n_lbls+1} ’ + df[i].astype(str)
Otherwise we will end up with 2 fields that have xfld=1

you mean root words, which could be common to different language groups. sounds like a great idea.
jeremy was talking about sub-words in class, which divides each words into its roots I think is close to this idea.

1 Like

I’m struggling with keeping Focal Loss from running out of memory (I’m trying to rewrite it since here there are so many target classes). I’m running the hinge loss version now (that was easier since there’s already a version in PyTorch).

Actually I needed this lesson, the emphasis in conceptual difference: Language models vs. custom embeddings

Somehow I didn’t get such a clear picture after part 1. My mental summary after class 4 part1 was “ok, custom embeddings”. So wrong! (My bad, I’ve rewatched the lesson and it was all already there, crystal clear).

But now finally after this lesson I think I got that “crux” of language model approach to transfer learning. I usually consider if I can not summarize an idea with a few simple sentences probably I dont really have the idea, so I tentatively would try to summarize:

-It is, but no not so much about custom embeddings “initialized” learning the structure of english.
-It is, but no not so much about letting custom embeddings learn classification task

-It is, much more, about both tasks sharing the architecture.

Probably I will reconsider this summary after a couple of more rewatches of lesson but as I said, really usefull all the times both Rachel and Jeremy emphasized “we are not using embeddings, but a language model”. After four or five times of hearing it the “heads up” worked. :grinning:

1 Like

You could just decrease the vocab size, at least for testing purposes. A vocab size of about 30,000 should still work OK.

Thanks - I tried that at first with a very abbreviated training run at 20,000, but then thought I might be able to sort out a version that would work with the full vocab. I think I need to figure out a way to iterate faster because right now it’s slow going from idea to results. I’ll go back to the smaller vocab size for now.

(ie, I suspect I might need to adjust alpha inside the focal loss function, but it’s a slow loop to go from changing that to seeing if the final results are better)

Yes the root of the word is what I was thinking. I feel like what Jeremy was talking about was lemmatization which isn’t quite what I was referring to because the resulting sub-word(s) aren’t always mapped to some meaningful root word instead they might just be arbitrarily generated. However if we borrowed from the idea behind Sentencepiece and pre-trained a model on word [root origins] (https://www.learnthat.org/pages/view/roots.html) and use that as the basis for sub-dividing words I feel like it may do better because then it has a lot more contextual information.

Was also thinking it might be interesting to look at “snapshots” or timestamps of a language over a particular duration of time. So the model kinda gets to time-travel and see how a particular language evolved over time thus perhaps picking up on some interesting semantics and reasoning behind why the words and sentences have become structured the way they are. This could also end up indirectly embedding cultural influences.

2 Likes

This paper has some great visualizations of changing word embeddings over time using New York Times articles, it’s a very cool method and I bet it could be applied in other cases: https://arxiv.org/abs/1703.00607

4 Likes

Here’s another one tracking gender and ethnic stereotypes in the US: https://arxiv.org/abs/1711.08412

4 Likes

wow, these are both amazing thanks! Glad to know I am not so crazy after all :slight_smile:

1 Like

Citation needed

10 Likes

Thanks for the instruction. After data download, one may need to add , encoding = 'utf8' for get_texts.

image

3 Likes

Another option may be to use env vars

export LC_ALL=en_US.UTF-8
export LANG=en_US.UTF-8

It worked for me

1 Like

What is the time required per epoch? I am at 20% of 1 epoch, already 16 min passed. Is this normal?

I am on Google Cloud, K80, 26 GB RAM, 4 vCPU.

1 Like