Finiteness and the use of float16 in Deep Learning

Hi everyone, I was having some interrogations about finiteness and Deep Learning and I thought I would share those with you.

The main concept behind training a neural network is to do a gradient descent. You consider your network as a real valued functions that depends on real valued weights and that does prediction on real valued inputs. However, we all know that we work with computers and that computers don’t deal with real numbers. As a result, we’re not performing a ‘real’ Gradient descent but the idea is that since we have a good precision, it works just the same. Our weights can only take finite amount of values, but it’s “good enough”.

However, when you start working with float16, that “We got enough precision” argument seems much less convincing to me. Are we performing an approximation of gradient descent, or some kind of gradient descent for discrete functions.

If that’s not convincing, imagine using float8, would our usual thinking about usual gradient descent apply there?

Where am I going with this? Well I’m curious, and I would like to investigate those phenomenons.

-What do you think of this? How does finiteness of the weights space interacts with the way we want to train the network?

-Do you know any good paper that deals with this topic?