You’re learning an embedding matrix - so if you have this representation of words in a specific corpus, you can get “topics” by clustering the words. So you could learn topics for your domain, in that sense.
New question - not specific to NLP. Jeremy just stated that his “best epoch” was around 10, on a process that went to 16 epochs - how can you pick the model at that 10th epoch - do you run it all over again and tell it just to 10 epochs? What if you wanted to use cycle_mult which will take you past 10?
use the cycle_save parameter
Who is Jeremy talking about right now? I didn’t catch the name
Sebastian Ruder?
Brillant! So much to digest!
True… Mind is blown.
Yes, it will be a fun week And it was amazing to see SoTA results right here right now!
Why are embeddings typically between 50 and 600? If you have such a high cardinality (i.e. 34,945), wouldn’t you use the same order of magnitude?
To save computation? Maybe it’s like dimension reduction?
why does the ratio of dropout matter in context of NLP? Is there an intuition behind it in context of this paper?
learn.clip=0.3 is clipping the gradient? I don’t understand how it works, it just limits the derivative of how the “ball” rolls down the hill?
How is word2vec different than embeddings?
Guess that’s how Jeremy represented in this lecture. This will become more clear in future lessons.
When we’ve a large ‘lr’, clip will help to ensure we don’t miss minima?
Guess both are embeddings. Instead of using Word2Vec embeddings, we create our own from scratch.
Per @yinterian question, we could use both…like merging or something?
When you use multiple languages, does this mean you have to use multiple models and select between them? Is there a way to use one big model?