So, the second question is easy to answer.
Until ReLu became popular, most of the neural networks either used tanh or a sigmoid function; neither of these is linear, both of these use exponential functions which means that calculating them and their gradients is more cost intensive than using a (partially) linear function. So, there have been other functions in use. Since these had the problem of saturating, neural networks were not that deep. ReLu does not have that problem and works well with SGD, so it was a good fit (came around 2009 I read? not sure about this).
There are also a lot of other function based on ReLu (Leaky ReLu, Parametric ReLu) which are either also partially or piecewise linear functions; but there are a lot of other functions as well (Rectifier (neural networks) - Wikipedia )
The answer is an activation function doesn’t need to ‘predict’ a negative value. The point of the activation function is not to give an equation to predict your final value, but to give a non-linearity to your neural network in the middle layers. You then use some appropriate function at the last layer to get the wanted output values. ex) softmax for classification, just linear for regression.