Lesson 4 official topic

This post is for topics related to lesson 4 of the course. This lesson is based partly on chapter 4 of the book, and partly on Chapter 10 (but using totally new material).

This is a wiki post - feel free to edit to add links from the lesson or other useful info.

<<< Lesson 3Lesson 5 >>>

Lesson resources

17 Likes

Hi Jeremy,

In the notebook, Getting started with NLP for absolute beginners, the part where you mention:

Transformers uses a DatasetDict for holding your training and validation sets. To create one that contains 20% of our data for the validation set, and 80% for the training set, use train_test_split :

But in the code following this is as follows:
dds = tok_ds.train_test_split(0.25, seed=42)

Shouldn’t the 0.25 be changed to 20 in the parameter in the train_test_split? For 20% of the data to be used for validation set, to connect with the write-up directly above?

6 Likes

Eagle eye! :slight_smile: Thanks for letting me know – will fix it now.

4 Likes

Lol! It’s 1:19AM in South Africa right now! :grinning:
You’re welcome.

4 Likes

Fast.ai Hackathon sounds like a great idea. Even as a regular event, maybe :thinking:

14 Likes

Could do various categories for the hackathon: NLP, tabular, images? Community votes on the forum.

How do you compare spacy with HF’s transformer ?

1 Like

If I’m not mistaken, spacy is built into fastai. it’s the default tokeniser etc.

2 Likes

Spacy is a language tokeniser rather than a transformer, so they can’t be directly compared. Text is split into tokens that are often word fractions to present a more efficient representation to a model to learn on.

1 Like

Q: how do you go from a model that is predicting the next word to one that can do classification?

6 Likes

Are there any specific situations where ULMFIT works better? Are Transformers now the new default, and you wouldn’t try ULMFIT unless the Transformer didn’t work well?

1 Like

Q: What is transformers architecture, why was that architecture better than ULMfit?

5 Likes

Question:
What corpora is Transformers pre-trained models based on? As ULMFiT is based on Wikipedia…

1 Like

A language model that predicts what is next from the past (autoregressive) can be ‘pre-trained’ on a large generic language corpus to learn the many little rules for the language. It can potentially be a ‘zero shot’ learner - where you could now ask it, in text, to classify something - that it may answer correctly.
However, often you would follow your pre-training stage with training on many examples of the specific type of thing it should be good at (like answering questions).

1 Like

What are the possible approaches for handling “non-natural” languages? E.g. a machine that spits out a series of (diagnostic) codes.

2 Likes

If you need to create a model that classifies text sentiment for a different language than English, how would you proceed? Fine-tune English-trained model on that language, or start from scratch? Use multi-language model and fine-tune? Probably it depends on language grammar? Like moving from English to German should be a good idea, I believe. But not so good for non-alphabet languages, for example.

3 Likes

It depends per model. For example the BERT model, which is one of the common transformer models, was pretrained on the following data:

The BERT model was pretrained on BookCorpus, a dataset consisting of 11,038 unpublished books and English Wikipedia (excluding lists, tables and headers).

(From here)

5 Likes

There was a huge chart of pretrained image models (Pytorch Image models). Is there an equivalent for NLP data?

3 Likes

Do you mean like a plot of performance or like a hub with all models?