Lesson 2 In-Class Discussion

KevinB · November 7, 2017, 2:53am

I’m thinking it is because you want to be on the safe side, so a little to the left will give you a little slower, but safer learning rate, but you will eventually get there?

hergertarian · November 7, 2017, 2:53am

Are there augmentations that work well for non-image data? E.g. time series, or text.

anandsaha · November 7, 2017, 2:54am

Learning rates affect the rate at which optimizers (like SGD, Adam, RMSProp etc.) converge to the bottom of the error surface. And optimizers are agnostic of the architecture of your neural network. That means, they won’t differentiate between a classic NN, CNN or RNN. They only know how to reduce the error using back propagation.

So yes, learning rate finder can be used for all architectures of NN.

ezequiel · November 7, 2017, 2:55am

In NLP there’s a paper where they replace Named Entity with other Named Entity to augment the data.

chingjunehao · November 7, 2017, 2:56am

@jeremy is the code in ConvLearner include those Conv layer (which we need to code the filter), max-pooling, normalization and so on?

jenna · November 7, 2017, 2:57am

Do you have a link? / title of the paper

Even · November 7, 2017, 2:57am

@jeremy (or anyone who knows) On the topic of learning rates… In this example we’re training in just a few epochs, but I can imagine that we’d sometimes encounter training over hundreds or even thousands of epochs. If we have that situation how often in your experience does this method work? Do we need to do this at a number of points over time in order to compute the different learning rates necessary across the different epochs?

setuc · November 7, 2017, 2:57am

What is the name of the reference paper?

anandsaha · November 7, 2017, 2:59am

It does. You can refer to this post: precompute=True to see what is deleted and what is added by convlerner.

–

johnnyv · November 7, 2017, 2:59am

I’d love to know how people use data augmentation for time series? Is it just adding noise or something else?

kmatsuda · November 7, 2017, 3:00am

The cyclic learning rate paper discussed setting a range (min and max) and oscillating between those values. Is there a reason why in this case we only pick 1 value?

init_27 · November 7, 2017, 3:00am

I think adding noise would defeat the purpose of increasing our accuracy?

yinterian · November 7, 2017, 3:00am

Jeremy will explain this part later.

kmatsuda · November 7, 2017, 3:01am

ok, thanks.

ezequiel · November 7, 2017, 3:01am

I think it’s: Adversarial Examples for Evaluating Reading Comprehension Systems
I can’t remember if this is exactly the paper.

But the subject was looking for adversarial examples on NLP, hope it helps

chingjunehao · November 7, 2017, 3:01am

@anandsaha alright. Thanks!

zaoyang · November 7, 2017, 3:02am

How do you do data augmentation with NLP?

johnnyv · November 7, 2017, 3:02am

It’s because you want the learning rate that has the fastest loss improvement. So if you were to take the derivative of that plot the peak might be a good place, and it would be roughly in the middle. However, you want as high a learning rate as possible because that means that you “move faster” towards the minima. So it’s a trade off between a learning rate that changes fast, and taking bigger steps. I think that’s why Jeremy chooses a point in that curve where the gradient is still good, but the learning rate is still kind of high.

santhanam · November 7, 2017, 3:03am

Is the data augmentation is possible for a single channel image?

yinterian · November 7, 2017, 3:06am

Yes, you can do it in a back and white image.