Does anyone know how to change the size of vocabulary? The default one as shown is 60,000. Can we change that number?
You can set those in both approaches, factory methods have them in kwargs.
What is the loss measuring in the language model when it is trying to predict the next word?
With training error 3.84 and validation error 3.97, Jeremy declares that it is still under-fitting. We have mentioned that when training error is larger than validation error, it is under-fitting. However, how do we tell here that it is still under-fitting, as the train error is lower than validation error?
There is no normalization step here because all the input data is in plain text format?
Backwards? You mean bidirectional?
Thanks. Could you please explain a bit more how it is possible? I assume we fix the vocab and generate the embeddings using their imdexes.
Yes sir. A pre-trained model that is trained against wiki103 in the reverse direction (the one weβre looking at it trained with a forward pass through the documents).
This for the advanced topic, but the short answer is that new words will have random embeddings at first, but the model will learn them after.
Sweet just answered my question
What happens to new words in imdb, will it get added to original language model?
No because they will share the same vocab
got an error. __init__() got an unexpected keyword argument 'max_vocab'
But I guess max_vocab was from TextDataset and TextLMDataBunch inherits from TextDataBunch.
Could you point how to set that in Datablock api? also tokenizer
How to add a custom tokenizer with datablock API?
What else can we override when we load a saved DataBunch β¦ anything besides batch size?
data_clas = TextClasDataBunch.load(path, 'tmp_clas', bs=50)
You have to look at what is now the preprocessor.
Is it possible to do regression using pre-trained model instead of just classification? How would the databunch look like for that?
What do you recommend for tokenizing/segmenting CJK languages? Isnβt that complicated?
Same, look at the processor argument.
It is believed that a batch size that is a power of 2 is more effective. However, it also makes sense that a larger batch size that utilizes the GPU better also makes training faster. So, how much exactly does having a batch size that is not a power of 2 hurt? For example, how does bs=32
and bs=33
compare in training speed?