I don’t think I really understand why it’s a good idea to use y_range / sigmoid_range on a regression model.

Why not just let the model’s weights and biases learn to predict things in that same range by itself?

And why use a sigmoid curve in cases where what we care about is not inherently “curved”? Ie. there’s nothing so special about the edges of the y distribution that we should devote so much x space to them.
It’s one thing if the regression is a probability of something: we might care quite a bit about the difference between 99% and 99.99% chance of something happening, so it makes sense to give a lot of room for the model to output those differences. But if the regression is just a spatial coordinate (eg finding the center of a face in an image), then the difference between 99 pixels and 99.99 pixels are no more important than the difference between 44 pixels and 44.99 pixels, so why devote so much of the model’s output range to it?

@ysaxon
Suppose your data is supposed to be between a particular range, say the y values are supposed to be between 5 and -5. You would want your model to set its weight such that the output always lie in that range. This means, a lot of weights need to be devoted just to scale the data, rather than learning the data (this might sound weird, but this is really what happens). If this intuitive explanation doesn’t satisfy you, let me dig into some mathematics, if you will.

This has to do something with the variance of the data, ie, how much the data is spread in space. Ideally, the larger the variance, the easier it is for a Neural Network to predict a particular value. As you would know, Neural Networks are essentially multiple matrix multiplications and additions, and a few non-linear functions in between. It so happens, that for even a slight change in matrix values, especially in the earlier layers, can lead to a huge change in the output. So just to compensate for this, the model would ideally have to dedicate weights in the subsequent layers to normalize the values. Now you would appreciate the fact, that if the model was left free to learn on data that was well spread, it neednt dedicate those weights to normalize.

The sigmoid function acts as our normalizing layer. So everything falls back into place, just perfectly. And it has shown great results.

You may note that sigmoid function doesn’t essentially make the input ‘curved’. Depending on how the model feeds input to the sigmoid function, it can give an output of any distribution. The key point, that we want to focus on, is that it acts as a continuous, differentiable, scaling function.
Cheers