Lesson 4 In-Class Discussion ✅

Xiwang · November 14, 2018, 3:07am

Does anyone know how to change the size of vocabulary? The default one as shown is 60,000. Can we change that number?

sgugger · November 14, 2018, 3:07am

You can set those in both approaches, factory methods have them in kwargs.

source99 · November 14, 2018, 3:07am

What is the loss measuring in the language model when it is trying to predict the next word?

PegasusWithoutWinds · November 14, 2018, 3:07am

With training error 3.84 and validation error 3.97, Jeremy declares that it is still under-fitting. We have mentioned that when training error is larger than validation error, it is under-fitting. However, how do we tell here that it is still under-fitting, as the train error is lower than validation error?

avatar · November 14, 2018, 3:07am

There is no normalization step here because all the input data is in plain text format?

cedric · November 14, 2018, 3:07am

Backwards? You mean bidirectional?

nithanaroy · November 14, 2018, 3:08am

Thanks. Could you please explain a bit more how it is possible? I assume we fix the vocab and generate the embeddings using their imdexes.

wgpubs · November 14, 2018, 3:09am

Yes sir. A pre-trained model that is trained against wiki103 in the reverse direction (the one we’re looking at it trained with a forward pass through the documents).

sgugger · November 14, 2018, 3:09am

This for the advanced topic, but the short answer is that new words will have random embeddings at first, but the model will learn them after.

hamelsmu · November 14, 2018, 3:09am

Sweet just answered my question

iyersathya · November 14, 2018, 3:09am

What happens to new words in imdb, will it get added to original language model?

hamelsmu · November 14, 2018, 3:10am

No because they will share the same vocab

fredguth · November 14, 2018, 3:10am

got an error. __init__() got an unexpected keyword argument 'max_vocab'

But I guess max_vocab was from TextDataset and TextLMDataBunch inherits from TextDataBunch.

Could you point how to set that in Datablock api? also tokenizer

cstorm125 · November 14, 2018, 3:10am

How to add a custom tokenizer with datablock API?

wgpubs · November 14, 2018, 3:10am

What else can we override when we load a saved DataBunch … anything besides batch size?

data_clas = TextClasDataBunch.load(path, 'tmp_clas', bs=50)

sgugger · November 14, 2018, 3:11am

You have to look at what is now the preprocessor.

shaun1 · November 14, 2018, 3:11am

Is it possible to do regression using pre-trained model instead of just classification? How would the databunch look like for that?

gpakosz · November 14, 2018, 3:11am

What do you recommend for tokenizing/segmenting CJK languages? Isn’t that complicated?

sgugger · November 14, 2018, 3:11am

Same, look at the processor argument.

PegasusWithoutWinds · November 14, 2018, 3:11am

It is believed that a batch size that is a power of 2 is more effective. However, it also makes sense that a larger batch size that utilizes the GPU better also makes training faster. So, how much exactly does having a batch size that is not a power of 2 hurt? For example, how does bs=32 and bs=33 compare in training speed?