Part 2 Lesson 10 wiki

DavidBressler · April 3, 2018, 2:22am

There’s probably some difference between the vocabulary for wikipedia and for IMDB. Does this matter?

Caleb · April 3, 2018, 2:22am

Is word2vec like pretraining?

Even · April 3, 2018, 2:22am

Do these pretrained models dictate the architecture and embedding sizes?

erinjerri · April 3, 2018, 2:23am

I tried this earlier hmmm, finicky.

narvind2003 · April 3, 2018, 2:23am

What is the wt_103 model? Is it AWD LSTM again?

YangL · April 3, 2018, 2:24am

http://files.fast.ai/robots.txt

does not exist, what ever it is.
… just saying.

lesscomfortable · April 3, 2018, 2:24am

That’s considered and is incorporated by finetuning the model. The idea behind transfer learning is that we can use considerable knowledge of a previous trained network for a new task but it doesn’t have to fit exactly

erinjerri · April 3, 2018, 2:25am

Sorry still getting this random error after updating my pip install upon selection of Python (selected at root)? Maybe something is wrong with my version of Python for this notebook??

  Notebook Validation failed: {u'model_id': u'e3839cad2e23478a84362ed0a931abf1', u'version_minor': 0, u'version_major': 2} is not valid under any of the given schemas:

{
“model_id”: “e3839cad2e23478a84362ed0a931abf1”,
“version_minor”: 0,
“version_major”: 2
}

narvind2003 · April 3, 2018, 2:26am

you have to match up the indices manually

nirantk · April 3, 2018, 2:27am

Hey Jeremy,

Hindi Language Models (might) work too!

I empathize that people didn’t trust language models to be effective. I work on Hindi (about 300M native speakers) texts. I used the Hindi Wikipedia to build a simple language model to get started.

There are no standard text classification datasets, so I am making two. I make all the code and language models available here on Github: https://github.com/NirantK/hindi2vec

Shout out to your FitLam work which made this really easy to do!
If anyone is interested in contributing, please check what I am upto on the README

blakewest · April 3, 2018, 2:27am

Are the weird special tokens (like ‘t_up’) also the same between the two language models? If not, it seems those would have to be relearned and we’d lose all that benefit of pre-training.

fizx · April 3, 2018, 2:27am

Do you know if adding new vocabulary items to an existing language model and fine-tuning is very sensitive to initialization parameters? Why initialize with the mean rather than some mean+random distribution? Does the model have issues confusing multiple new vocab words if they are initialized to the same embedding value?

Paras · April 3, 2018, 2:28am

Not a related question but Can we use Deep Learning to resolve entities?

nirantk · April 3, 2018, 2:29am

Yes Paras, there is some work on entity resolution using seq2seq models and similar.

Please check the coref work from AllenAi here: http://allennlp.org/models

YangL · April 3, 2018, 2:30am

Okay, I don’t get it.

So, the word embeddings are pretrained and loade din pickle to itos2. okay.

But like, what happens to a previously unseen word? what happens to it?

nafizh · April 3, 2018, 2:30am

Is the code for the wikitext-103 model available somewhere?

emilmelnikov · April 3, 2018, 2:30am

It gets mapped to the <UNK> (unknown) token.

YJP · April 3, 2018, 2:31am

What does ‘t//64’ stand for after getting a length of all concatenated texts? Is it associated with a batch size though the batch size seems to be 52?

narvind2003 · April 3, 2018, 2:31am

yes. tokenizer should be common for all texts. vocab indices need to be matched manually across texts.

vikbehal · April 3, 2018, 2:31am

I think Jeremy used the same code as IMDB