Lesson 4 official topic

jeremy · May 13, 2022, 7:28pm

This post is for topics related to lesson 4 of the course. This lesson is based partly on chapter 4 of the book, and partly on Chapter 10 (but using totally new material).

This is a wiki post - feel free to edit to add links from the lesson or other useful info.

<<< Lesson 3｜Lesson 5 >>>

Lesson resources

Recording
Notebooks for this lesson:
- Getting started with NLP for absolute beginners
Solutions to chapter 4 questions from the book
Solutions to chapter 10 questions from the book

Zakia · May 16, 2022, 11:16pm

Hi Jeremy,

In the notebook, Getting started with NLP for absolute beginners, the part where you mention:

Transformers uses a DatasetDict for holding your training and validation sets. To create one that contains 20% of our data for the validation set, and 80% for the training set, use train_test_split :

But in the code following this is as follows:
dds = tok_ds.train_test_split(0.25, seed=42)

Shouldn’t the 0.25 be changed to 20 in the parameter in the train_test_split? For 20% of the data to be used for validation set, to connect with the write-up directly above?

jeremy · May 16, 2022, 11:18pm

Eagle eye! Thanks for letting me know – will fix it now.

Zakia · May 16, 2022, 11:19pm

Lol! It’s 1:19AM in South Africa right now!
You’re welcome.

devforfu · May 17, 2022, 8:07am

Fast.ai Hackathon sounds like a great idea. Even as a regular event, maybe

Raymond-Wu · May 17, 2022, 8:11am

Could do various categories for the hackathon: NLP, tabular, images? Community votes on the forum.

prosper · May 17, 2022, 8:14am

How do you compare spacy with HF’s transformer ?

strickvl · May 17, 2022, 8:16am

If I’m not mistaken, spacy is built into fastai. it’s the default tokeniser etc.

SamFogarty · May 17, 2022, 8:18am

Spacy is a language tokeniser rather than a transformer, so they can’t be directly compared. Text is split into tokens that are often word fractions to present a more efficient representation to a model to learn on.

strickvl · May 17, 2022, 8:18am

Q: how do you go from a model that is predicting the next word to one that can do classification?

checkmate404 · May 17, 2022, 8:21am

Are there any specific situations where ULMFIT works better? Are Transformers now the new default, and you wouldn’t try ULMFIT unless the Transformer didn’t work well?

kurianbenoy · May 17, 2022, 8:22am

Q: What is transformers architecture, why was that architecture better than ULMfit?

Zakia · May 17, 2022, 8:23am

Question:
What corpora is Transformers pre-trained models based on? As ULMFiT is based on Wikipedia…

SamFogarty · May 17, 2022, 8:23am

A language model that predicts what is next from the past (autoregressive) can be ‘pre-trained’ on a large generic language corpus to learn the many little rules for the language. It can potentially be a ‘zero shot’ learner - where you could now ask it, in text, to classify something - that it may answer correctly.
However, often you would follow your pre-training stage with training on many examples of the specific type of thing it should be good at (like answering questions).

gautam_e · May 17, 2022, 8:24am

What are the possible approaches for handling “non-natural” languages? E.g. a machine that spits out a series of (diagnostic) codes.

devforfu · May 17, 2022, 8:24am

If you need to create a model that classifies text sentiment for a different language than English, how would you proceed? Fine-tune English-trained model on that language, or start from scratch? Use multi-language model and fine-tune? Probably it depends on language grammar? Like moving from English to German should be a good idea, I believe. But not so good for non-alphabet languages, for example.

ilovescience · May 17, 2022, 8:25am

It depends per model. For example the BERT model, which is one of the common transformer models, was pretrained on the following data:

The BERT model was pretrained on BookCorpus, a dataset consisting of 11,038 unpublished books and English Wikipedia (excluding lists, tables and headers).

(From here)

Raymond-Wu · May 17, 2022, 8:25am

There was a huge chart of pretrained image models (Pytorch Image models). Is there an equivalent for NLP data?

ilovescience · May 17, 2022, 8:26am

Do you mean like a plot of performance or like a hub with all models?