MIPS uses a hash function to find the maximum inner product. (MIPS is statistical. It finds with a high probability the X% highest inner products)
Example:
Given the input (x)
The activation of a is the product of the (w) with the input x: w*x (vector multiplication)
If the next layer has 100 hidden units, then you have 100 weight vectors. With MIPS you find the 5 weight vectors out of the 100, which results in the highest activation (in an efficient way)
About the original paper:
They introduce the idea, that for each layer you use the MIPS to find the weight vectors, which will have the highest activation.
For the backward pass, you update only the weight vectors of the highest activation
In each layer you reduce the computational costs to:
– Finding the 5% of weight vectors (almost linear)
– Calculating the forward pass / backward pass only for the 5%
– You require only around 5% of the computation comparing calculating it the whole network
One question I couldn’t understand:
How do you use their batch_size?
Processing multiple input as once make the matrix multiplication more efficient / use parallism
In my understanding the hashing function works only for one vector. So I can use only batch_size=1?
I still find the idea interesting, that you can reduce the computational costs with this method.
They implemented it for CPU, but I am still thinking if it is possible to implement it in tensorflow (help is welcomed )
How to know highest X% activation in each layer without actually performing forward/backward prop?
I’ve spent a great deal of time without any GPU and also things like freezing and optimizing tensorflow graphs for inference on mobile apps. I’d be keen to learn more about practical methods to reduce computational complexity.
The MIPS (maximum inner product search) can find the weights for the highest activation without calculating the whole matrix multiplication. The video, which is linked, explains it.
Short summary:
You hash each weight vector
You hash the input vector
You receive a subset of weight vectors, which has a high probability that they produce the highest X% of activations
With this subset of weight vectors, you calculate the actual activations
The hash function is different to the current hash fuctions
To find the maximum inner product search you need some specific characteristics -> it is really good explained in the 20min talk
Is that what their paper is doing? If that’s the case, it doesn’t seem like hashing is the secret: it seems like skipping a bunch of the calculations is the secret
It’s way too late and I am way too tired to dig through this right now, but I am going to check it out tomorrow.