I wonder why, after engineering all these ‘Before’ and ‘After’ features, only the '‘AfterStateHoliday’, ‘BeforeStateHoliday’ features are used in training – they are listed among the “contin_vars”. But nowhere can I see the '‘AfterSchoolHoliday’, ‘BeforeSchoolHoliday’ and '‘AfterPromo’, ‘BeforePromo’ features. Am I missing something? Also, why are ‘StateHoliday_fw’, ‘StateHoliday_bw’ listed among “cat_vars”, but '‘AfterStateHoliday’, ‘BeforeStateHoliday’ listed among “contin_vars”? Anyone has any thoughts about this?
I think, to use RMSE as loss function Jeremy took log of y’s. Because of property of log(a/b) = log(a)-log(b). RMSE of log(y) will be same as RMSPE of y.
Please let me know if I am wrong.
I was playing around with the notebook from lesson 4 and got some test output with this code (I used n1 for the unsequeezing to get a 2D input):
print(ss,"\n") for i in range(50): n1 = res[-1].topk(2) n2 = n1 if n1.data==0 else n1 print(TEXT.vocab.itos[n2.data], end=' ') res,*_ = m(n1.unsqueeze(0)) print('...')
The output looks similar to this:
. So, it wasn't quite was I was expecting, but I really liked it anyway! The best movie a worst movie i i me movie worth good . but i is not shame of <eos> do not that 's a a worst movie ever 've ever in but i 's n't good see see . anyone of people of <eos> 'm recommend recommend recommend to it movie ...
(The output includes this error message: UserWarning: invalid index of a 0-dim tensor. This will be an error in PyTorch 0.5. Use tensor.item() to convert a 0-dim tensor to a Python number.)
I’m not sure if this is correct because it looks reasonable to a very small degree and the spaces are placed incorrectly after the punctuation symbols.
Maybe the line with unsqueezing needs to be adapted that it takes 1D data?
Maybe I’ll try that later…
Hi, I have a question regarding the dropout rate. For example if we set dropout rate to 0.5, and we have 100 nodes. Is the number of randomly chosen dropped out nodes is exactly 50 for each batch/iteration, or the probability of dropout is applied to each node separately, which means the dropped out nodes are sometimes more than and sometimes less than 50. If the latter is the case, is the activation multiplied by 2.0 exactly during training regardless of the actual number of dropout?
By nodes I believe you are referring to the number of activation units in each layer. By dropout = 0.5 and the if the number of activation units in that layer is 100 , during training roughly around 50 of the activation units output is used. Since information is lost because of this dropout, the PyTorch library
internally does scaling (if dropout = 0.5, scaling = double )of the outputs of the remaining activation units to preserve the average from the outputs of the activation units. The dropout is different for every layer, and every epoch…
Suppose by chance 60 out of 100 units are dropped out (the probability calculated from binomial distribution is about 1% when dropout = 0.5), and only 40 are used in a given iteration, why is the scaling factor set to 2.0 = 1/0.5 rather than 2.5 = 100/40?
So when we first predict with our language model, we run the following:
# Set batch size to 1 m.bs=1 # Turn off dropout m.eval() # Reset hidden state m.reset() # Get predictions from model res,*_ = m(t) # Put the batch size back to what it was m.bs=bs
We set the batch size to 1, set the model to eval, reset the hidden state, get a prediction, then set the batch size back to what it was.
Later on we run:
print(ss,"\n") for i in range(500): n=res[-1].topk(2) n = n if n.data==0 else n print(TEXT.vocab.itos[n.data], end=' ') res,*_ = m(n.unsqueeze(0)) print('...')
In this cell we don’t do anything with the batch size or reset the hidden state. When is it necessary to do these things? How does resetting the hidden state or changing the batch size affect prediction?
Also in the same section, we tokenize and predict on the sentence:
ss=""". So, it wasn't quite was I was expecting, but I really liked it anyway! The best"""
When tokenized, this sentence has 21 tokens. When we run that through our model, the output res is 21xvocab size matrix. To look at the next predicted word, we take res[-1], the last row, corresponding to the final word in the input sentence. Are the values in res[-1] affected by the previous rows? Or to put it another way, how does predicting on the entire sentence and taking the last row compare to just predicting on the last word?
At mark 1:50:00 in the video there’s a matrix drawn for our “split” … What I don’t understand is how did we get 10million for the Y axis. I thought we’re just splitting it (from 64 mil.) so that would make it 1Mil… Or are we multiplying / augmenting everything x10 (640mil.)?
How to use ColumnarModelData for logistic regression?