Hashing in Deep Learning - 95% less computations

benediktschifferer · June 28, 2017, 11:58am

Hey,

I found/read this article, right now.

https://www.sciencedaily.com/releases/2017/06/170601135633.htm

This sounds super interesting and high impact for the deep learning community.

I haven’t understood the paper totally, but with their method you save 95% of computational costs by losing 1% of accuracy.

If someone is interesting in the article: My understanding:

Instead of computing the forward and backward pass for the whole network, you only consider the highest X% activation in each layer.
There are already some techniques doing this - like adaptive dropout and winner takes it all
The problem is: How to find the highest X% activation in an efficient way - they call it a “search/query problem”. (Lets say 5%)
In a previous paper they introduced hash functions for maximum inner product search (MIPS), which is almost linear:
https://www.youtube.com/watch?v=g7zF0wM1q0c
MIPS uses a hash function to find the maximum inner product. (MIPS is statistical. It finds with a high probability the X% highest inner products)

Example:

Given the input (x)
The activation of a is the product of the (w) with the input x: w*x (vector multiplication)
If the next layer has 100 hidden units, then you have 100 weight vectors. With MIPS you find the 5 weight vectors out of the 100, which results in the highest activation (in an efficient way)

About the original paper:

They introduce the idea, that for each layer you use the MIPS to find the weight vectors, which will have the highest activation.
For the backward pass, you update only the weight vectors of the highest activation
In each layer you reduce the computational costs to:
– Finding the 5% of weight vectors (almost linear)
– Calculating the forward pass / backward pass only for the 5%
– You require only around 5% of the computation comparing calculating it the whole network

One question I couldn’t understand:

How do you use their batch_size?
Processing multiple input as once make the matrix multiplication more efficient / use parallism
In my understanding the hashing function works only for one vector. So I can use only batch_size=1?

I still find the idea interesting, that you can reduce the computational costs with this method.
They implemented it for CPU, but I am still thinking if it is possible to implement it in tensorflow (help is welcomed )

twairball · July 2, 2017, 12:45am

How to know highest X% activation in each layer without actually performing forward/backward prop?

I’ve spent a great deal of time without any GPU and also things like freezing and optimizing tensorflow graphs for inference on mobile apps. I’d be keen to learn more about practical methods to reduce computational complexity.

benediktschifferer · July 2, 2017, 5:21am

The MIPS (maximum inner product search) can find the weights for the highest activation without calculating the whole matrix multiplication. The video, which is linked, explains it.

Short summary:

You hash each weight vector
You hash the input vector
You receive a subset of weight vectors, which has a high probability that they produce the highest X% of activations
With this subset of weight vectors, you calculate the actual activations

The hash function is different to the current hash fuctions
To find the maximum inner product search you need some specific characteristics -> it is really good explained in the 20min talk

natedl98 · July 5, 2017, 8:31am

benediktschifferer:

Hey,

I found/read this article, right now.

https://www.sciencedaily.com/releases/2017/06/170601135633.htm
https://arxiv.org/pdf/1602.08194.pdf

This sounds super interesting and high impact for the deep learning community.

I haven’t understood the paper totally, but with their method you save 95% of computational costs by losing 1% of accuracy.

If someone is interesting in the article: My understanding:

Instead of computing the forward and backward pass for the whole network, you only consider the highest X% activation in each layer.

There are already some techniques doing this - like adaptive dropout and winner takes it all

The problem is: How to find the highest X% activation in an efficient way - they call it a “search/query problem”. (Lets say 5%)

In a previous paper they introduced hash functions for maximum inner product search (MIPS), which is almost linear:
https://www.youtube.com/watch?v=g7zF0wM1q0c

MIPS uses a hash function to find the maximum inner product. (MIPS is statistical. It finds with a high probability the X% highest inner products)

Example:

Given the input (x)

The activation of a is the product of the (w) with the input x: w*x (vector multiplication)

If the next layer has 100 hidden units, then you have 100 weight vectors. With MIPS you find the 5 weight vectors out of the 100, which results in the highest activation (in an efficient way)

About the original paper:

They introduce the idea, that for each layer you use the MIPS to find the weight vectors, which will have the highest activation.

For the backward pass, you update only the weight vectors of the highest activation

In each layer you reduce the computational costs to:
– Finding the 5% of weight vectors (almost linear)
– Calculating the forward pass / backward pass only for the 5%
– You require only around 5% of the computation comparing calculating it the whole network

One question I couldn’t understand:

How do you use their batch_size?

Processing multiple input as once make the matrix multiplication more efficient / use parallism

In my understanding the hashing function works only for one vector. So I can use only batch_size=1?

I still find the idea interesting, that you can reduce the computational costs with this method.
They implemented it for CPU, but I am still thinking if it is possible to implement it in tensorflow (help is welcomed )

Is that what their paper is doing? If that’s the case, it doesn’t seem like hashing is the secret: it seems like skipping a bunch of the calculations is the secret

It’s way too late and I am way too tired to dig through this right now, but I am going to check it out tomorrow.

benediktschifferer · July 5, 2017, 8:39am

@natedl98 :

thats what the paper is doing. Yes, one part of the secret is to skipp the bunch of calculation.

The problem is:
How to find out which calculation you can skip in an efficient way?