This post is for topics related to lesson 4 of the course. This lesson is based partly on chapter 4 of the book, and partly on Chapter 10 (but using totally new material).
This is a wiki post - feel free to edit to add links from the lesson or other useful info.
Transformers uses a DatasetDict for holding your training and validation sets. To create one that contains 20% of our data for the validation set, and 80% for the training set, use train_test_split :
But in the code following this is as follows: dds = tok_ds.train_test_split(0.25, seed=42)
Shouldn’t the 0.25 be changed to 20 in the parameter in the train_test_split? For 20% of the data to be used for validation set, to connect with the write-up directly above?
Spacy is a language tokeniser rather than a transformer, so they can’t be directly compared. Text is split into tokens that are often word fractions to present a more efficient representation to a model to learn on.
Are there any specific situations where ULMFIT works better? Are Transformers now the new default, and you wouldn’t try ULMFIT unless the Transformer didn’t work well?
A language model that predicts what is next from the past (autoregressive) can be ‘pre-trained’ on a large generic language corpus to learn the many little rules for the language. It can potentially be a ‘zero shot’ learner - where you could now ask it, in text, to classify something - that it may answer correctly.
However, often you would follow your pre-training stage with training on many examples of the specific type of thing it should be good at (like answering questions).
If you need to create a model that classifies text sentiment for a different language than English, how would you proceed? Fine-tune English-trained model on that language, or start from scratch? Use multi-language model and fine-tune? Probably it depends on language grammar? Like moving from English to German should be a good idea, I believe. But not so good for non-alphabet languages, for example.
It depends per model. For example the BERT model, which is one of the common transformer models, was pretrained on the following data:
The BERT model was pretrained on BookCorpus, a dataset consisting of 11,038 unpublished books and English Wikipedia (excluding lists, tables and headers).