Lesson 4 In-Class Discussion ✅

The language model we use is publicly available, though. No?

Yes, correct, but I was thinking to use the lib instead of writing custom training loop and models =) Also, they provide some embeddings already.

Why not also train the language model on the unsupervised entries in the IMDB dataset?

If we don’t hold out a validation set, does that mean there is no possibility of overfitting when building the language model?

2 Likes

We do!

1 Like

How to expand the vocab to medical records from wiki text if using transfer learning? Assuming vocab only considers high frequency English words from Wikipedia

6 Likes

A validation set is hold out, just a tinier portion (10k reviwes instead of 25k)

1 Like

Is there a backwards pre-trained wiki103 model?

1 Like

The new model we finetune will have new words in its vocab. That’s fine, it will learn their meaning during the fine-tuning.

1 Like

You create the vocab from your own dataset.

Is there an analogue of Language Models for Images? Unstructured learning on an image corpus? e.g. obstructing part of the image and trying to predict it

3 Likes

Does that work also for titlecase, and mixed case?

TextLMDataBunch does not let us set bs nor max_vocab anymore. How do we set that?
I guess we should use new DataBlock api, but how to set those?

3 Likes

Can anyone find a source/citation for what Jeremy was talking about with SwiftKey(?) with the generated LaTeX proofs?

When fitting with Wikipedia there is no risk of overfitting because that is not the task that we are going to test the model on. With IMDB as Sylvain said, there is a validation set to avoid overfitting.

3 Likes

what is moms?

3 Likes

If we use another language. Where I set Lang=´pt´ for example? And do I set that it is going to use spacy?

1 Like

Momentums

Where are the English language punctuation rules defined?

1 Like

This competition is particularly strict, they’re limiting external data to a pre-selected set of embeddings. (see discussion https://www.kaggle.com/c/quora-insincere-questions-classification/discussion/70978#418095 for example). But the un-trained models and training techniques would still be of use, as @devforfu noted