Part 2 Lesson 10 wiki

There’s probably some difference between the vocabulary for wikipedia and for IMDB. Does this matter?

3 Likes

Is word2vec like pretraining?

3 Likes

Do these pretrained models dictate the architecture and embedding sizes?

3 Likes

I tried this earlier hmmm, finicky.

What is the wt_103 model? Is it AWD LSTM again?

4 Likes
http://files.fast.ai/robots.txt

does not exist, what ever it is.
… just saying.

That’s considered and is incorporated by finetuning the model. The idea behind transfer learning is that we can use considerable knowledge of a previous trained network for a new task but it doesn’t have to fit exactly :grinning:

Sorry still getting this random error after updating my pip install upon selection of Python (selected at root)? Maybe something is wrong with my version of Python for this notebook??

  Notebook Validation failed: {u'model_id': u'e3839cad2e23478a84362ed0a931abf1', u'version_minor': 0, u'version_major': 2} is not valid under any of the given schemas:

{
“model_id”: “e3839cad2e23478a84362ed0a931abf1”,
“version_minor”: 0,
“version_major”: 2
}

you have to match up the indices manually

Hey Jeremy,

Hindi Language Models (might) work too!

I empathize that people didn’t trust language models to be effective. I work on Hindi (about 300M native speakers) texts. I used the Hindi Wikipedia to build a simple language model to get started.

There are no standard text classification datasets, so I am making two. I make all the code and language models available here on Github: https://github.com/NirantK/hindi2vec

Shout out to your FitLam work which made this really easy to do!
If anyone is interested in contributing, please check what I am upto on the README :slight_smile:

25 Likes

Are the weird special tokens (like ‘t_up’) also the same between the two language models? If not, it seems those would have to be relearned and we’d lose all that benefit of pre-training.

3 Likes

Do you know if adding new vocabulary items to an existing language model and fine-tuning is very sensitive to initialization parameters? Why initialize with the mean rather than some mean+random distribution? Does the model have issues confusing multiple new vocab words if they are initialized to the same embedding value?

3 Likes

Not a related question but Can we use Deep Learning to resolve entities?

1 Like

Yes Paras, there is some work on entity resolution using seq2seq models and similar.

Please check the coref work from AllenAi here: http://allennlp.org/models

4 Likes

Okay, I don’t get it.

So, the word embeddings are pretrained and loade din pickle to itos2. okay.

But like, what happens to a previously unseen word? what happens to it?

Is the code for the wikitext-103 model available somewhere?

1 Like

It gets mapped to the <UNK> (unknown) token.

6 Likes

What does ‘t//64’ stand for after getting a length of all concatenated texts? Is it associated with a batch size though the batch size seems to be 52?

1 Like

yes. tokenizer should be common for all texts. vocab indices need to be matched manually across texts.

2 Likes

I think Jeremy used the same code as IMDB

1 Like