There’s probably some difference between the vocabulary for wikipedia and for IMDB. Does this matter?
Is word2vec like pretraining?
Do these pretrained models dictate the architecture and embedding sizes?
I tried this earlier hmmm, finicky.
What is the wt_103 model? Is it AWD LSTM again?
http://files.fast.ai/robots.txt
does not exist, what ever it is.
… just saying.
That’s considered and is incorporated by finetuning the model. The idea behind transfer learning is that we can use considerable knowledge of a previous trained network for a new task but it doesn’t have to fit exactly
Sorry still getting this random error after updating my pip install
upon selection of Python (selected at root)? Maybe something is wrong with my version of Python for this notebook??
Notebook Validation failed: {u'model_id': u'e3839cad2e23478a84362ed0a931abf1', u'version_minor': 0, u'version_major': 2} is not valid under any of the given schemas:
{
“model_id”: “e3839cad2e23478a84362ed0a931abf1”,
“version_minor”: 0,
“version_major”: 2
}
you have to match up the indices manually
Hey Jeremy,
Hindi Language Models (might) work too!
I empathize that people didn’t trust language models to be effective. I work on Hindi (about 300M native speakers) texts. I used the Hindi Wikipedia to build a simple language model to get started.
There are no standard text classification datasets, so I am making two. I make all the code and language models available here on Github: https://github.com/NirantK/hindi2vec
Shout out to your FitLam work which made this really easy to do!
If anyone is interested in contributing, please check what I am upto on the README
Are the weird special tokens (like ‘t_up’) also the same between the two language models? If not, it seems those would have to be relearned and we’d lose all that benefit of pre-training.
Do you know if adding new vocabulary items to an existing language model and fine-tuning is very sensitive to initialization parameters? Why initialize with the mean rather than some mean+random distribution? Does the model have issues confusing multiple new vocab words if they are initialized to the same embedding value?
Not a related question but Can we use Deep Learning to resolve entities?
Yes Paras, there is some work on entity resolution using seq2seq models and similar.
Please check the coref work from AllenAi here: http://allennlp.org/models
Okay, I don’t get it.
So, the word embeddings are pretrained and loade din pickle to itos2. okay.
But like, what happens to a previously unseen word? what happens to it?
Is the code for the wikitext-103 model available somewhere?
It gets mapped to the <UNK>
(unknown) token.
What does ‘t//64’ stand for after getting a length of all concatenated texts? Is it associated with a batch size though the batch size seems to be 52?
yes. tokenizer should be common for all texts. vocab indices need to be matched manually across texts.
I think Jeremy used the same code as IMDB