Lesson 4 In-Class Discussion

(Walter Vanzella) #288

Thank you Lucas,
Yes I applied the RELU activation function.
The activation is A(layer_i) = sum( RELU( A(layer_i-1) ) * w ) . I hope this is correct.
But w is positive or negative with equal probability ( maybe this assumption is wrong).
So on average the result of the sum does not change. We never do sum(activations), we do sum(activations*w).

(Lucas Goulart Vazquez) #289

For calculating the mean of the activations of layer L we do sum(activations[L]) / #Neurons, and in my mind (and of course I could be wrong) thatâs what changes when we apply dropout

(Walter Vanzella) #290

Yes true.
I was referring to the activation of the single neuron of a layer with respect to the previous afferent activations.
I think that this activation does not change (if my assumptions are correct ).
But the average layer activation changes as you said.

(Hiromi Suenaga) #291

I have a question about filling in `NA` in the Rossmann notebook.

In the notebook, it states that

many models have problems when missing values are present, so itâs always important to think about how to deal with them. In these cases, we are picking an arbitrary signal value that doesnât otherwise appear in the data.

And the code following that looks like:

``````joined.CompetitionOpenSinceYear = joined.CompetitionOpenSinceYear.fillna(1900).astype(np.int32)
joined.CompetitionOpenSinceMonth = joined.CompetitionOpenSinceMonth.fillna(1).astype(np.int32)
joined.Promo2SinceYear = joined.Promo2SinceYear.fillna(1900).astype(np.int32)
joined.Promo2SinceWeek = joined.Promo2SinceWeek.fillna(1).astype(np.int32)
``````

By looking at the initial data exploration, these values do appear in the data (i.e. minimum of CompetitionOpenSinceYear is 1900 etc).

How do they work as signal values when there are some rows with these values in the original data?

(James Requa) #292

I used the new trick we learned in this lesson 4 to check the model layers simply by calling `learn` and was surprised to see that even with binary classification the fastai models are creating 2 class output and using a Softmax activation function instead of Sigmoid with 1 binary class output.

``````(16): Linear (512 -> 2)
(17): LogSoftmax ()
``````

My labels are all 0 or 1 so Iâm not sure how else to tell the model that it should be using Sigmoid. Or am I just not understanding this correctly and its OK to use Softmax for binary classification problems as long as you treat it as 2 classes?

(Lucas Goulart Vazquez) #293

I would say itâs okay to use softmax, and just a nice observation, if you do the math you will see that the cross entropy loss used by the softmax function will simplify to the same thing as the logistic loss ^^

(James Requa) #294

Thanks! Thatâs an interesting observationâŚAlso, because of the 2 classes Iâve just been taking the logs that correspond to the â1â label since in my case the binary classification is the probability of whether or not some âthingâ exists not the probability that it doesnât exist.

(Jordan) #295

Could one pre-train zip code weights like you did the language model? A naive version could simply ensure that zip codes physically close together by distance or transit time have embeddings near each other.

Though it seems ideally one could train embeddings from travel patterns like what youâd get from credit card transactions, no?

(Brendan Herger) #296

I think it is the plural of `p`, not an acronym. If I remember correctly, the ps layer provides the probability for dropout layers, either as a constant (e.g. `p=.5`), or a list with one p for each drop out layer (e.g. `p=[.2, .5, .7]`) for 3 dropout layers.

I believe the loss function also tells you what direction to adjust the weights of each parameter (gradient descent), so they can be adjusted appropriately, whereas the accuracy is just a score (ok, the model is not so great, but then what?).

(Suvash) #298

should be OK, as I think we used it for Cats vs Dogs. (assuming here, since we get multiclass predictions as outputs prob_dog and prob_cat, but used prob_dog for Kaggle submissions; and because of softmax itâs likely that prob_cat = 1 - prob_dog; or maybe the log of that actually )
The nice thing about softmax seems like we might not have to think/tune the âthresholdâ as weâd have to do with sigmoidal outputs. But, once again, Iâm just assuming here. Maybe @yinterian can clarify this better.

(Suvash) #299

Woah ! This was definitely one of the densest lectures so far. Embeddings are so neat. All that space wasted by one-hot encoded matrices finally getting reclaimed.

So much new stuff, will have to go back and watch the whole lecture again. Also, @jeremy for beating the state of the art like âno big dealâ ?

I guess we are halfway in the course now, right ? Iâm already experiencing denial, not wanting this to end.

(Jeremy Howard) #300

Yes! You should pre-train them by making them embeddings in some different model - e.g. learning some different task. Just like pre-training on a language model and using for classification.

(Jeremy Howard) #301

We know in this case that the stores werenât actually open at that time, so we assume that it was already being used as a signal value. IIRC this is an assumption the original developers of this solution made.

(Hiromi Suenaga) #302

Got it! Thank you

We are doing a fastai study group in Omaha, NE this evening. @KevinB says hi!

(Jeremy Howard) #303

Thatâs so cool! Learn lots!

#304

What do you guys think the best way to reduce high class imbalance. I am solving classification problem of diabetic retinopathy from https://www.kaggle.com/c/diabetic-retinopathy-detection. The class distribution is approximately ( class1 - 73% , class2 -3%, class3 -14%, class4 - 2%, class5 - 2%) . Should I follow the template provided for dog breed classification - or should I shed some examples out of the dominant class , so that all the classes have similar representation. Thanks

(Jeremy Howard) #305

You can replicate the rare classes to make them more balanced. Never throw away data!

Note that it you double the number of class5 rows (for instance), your probabilities will end up too high, so youâll need to reduce them at the end.

(Rikiya Yamashita) #306

@jeremy Iâd be glad if you could reply this, maybe silly question though. Thanks!

(Jeremy Howard) #307

Generally youâll want categorical, as long as there arenât too many levels. Categorical give the model more flexibility.