Quantisation and Pruning Research with Fast.ai

karanchahal · August 9, 2019, 6:51pm

I wanted to introduce quantisation and pruning research at fast.ai as compressing models is fast becoming needed as more and more companies look to productionalise their models. I was also wondering if we could implement quantisation and pruning from scratch in fast ai so that people can learn how it is done.

It is quite easy ! I have made 2 notebooks each for quantisation and pruning that trains a MNIST model and quantises it to 8 bits (both weights and activations) and prunes the network to a sparsity of 90% both with next to no loss in accuracy !

I would be really grateful if @jeremy or @sgugger could take a look and check if this is something of interest to add to the course syllabus of the course.

Before this, I would like to introduce quantisation and pruning to people.

Nowadays, there is a need to take the floating point models that have been trained and deploy them to edge devices. One way that is popular is to quantise the weights and activations of a neural network to a lower bit width (eg: 8 bits or even 4 bits). The benefits of this are 2 fold:

Some accelerators perform computation at lower bit widths much faster than fp16 or fp32 computation.
The model takes less space, and the savings increase by a substantial factor every time we reduce a bit from the tensor data type.

People have tried other means to compress a model, one of them is pruning.
Pruning basically means that some of the weights of a neural network are zero, hence we seek to introduce sparsity in the network.

The benefits of this are that you potentially do not have to perform the useless multiplications with zeros hence providing a potential computation saving. Research has shown that even after pruning ~80% of weights (this is fine grained pruning), the network preserves it’s accuracy . This is a very surprising result. Course grained pruning (setting all weights of a channel to zero) also works to an extent but results in significantly more accuracy loss. This is an active research area.

Generally how quantisation works is through the use of a scale value and a zero point value, so each quantised tensor needs to have the quantised tensor, it’s scale and zero point. The scale and zero point are needed to convert to and from quantised and dequantized tensors.

There are 2 ways to quantize a model:

Post training quantisation: Quantises a trained model, no retraining required (works well for down to 8 bits).
Quantisation Aware Training: A way to train a model to induce robustness to quantisation. (It works well for aggressive quantizations schemes (down to 4 bits))

I have successfully implemented the post training quantisation algorithms and was able to get a quantised MNIST model down to 8 bits with next to no accuracy loss. Going down to 4 bits resulted in the model diverging.I am currently working on quant aware training as of now. If you want to see how post train quantisation works, please check out this Google colab notebook.

Now, let’s come to pruning:

Pruning is a very general thing, there could be a lot of ways to perform it. As far as I know, there is generally a “pruning schedule”. The researcher decided when to prune how many percent of weights (aka the degree of sparsity of the layer). Now, they could prune some layers, leave some as is. Slowly increase the sparsity degree of the pruned players with time during training. There are also different types of pruning, a structured way to prune weights (eg: take off full channels of a conv kernel or reduce a dimension of a fully connected layer by 1) or an unstructured way to prune (randomly zero out weights).
Fast.ai could potentially offer a structured and unstructured way to prune to help out researchers. If you would like to see pruning in action, I have tried pruning out on an MNIST model by using the Google paper algorithm, “To Prune or not to Prune”. It is unstructured pruning with 90% sparsity and I was able roughly the same accuracy as the un-pruned model. This is the Google Colab link for it.

Describe alternatives you’ve considered
Right now Pytorch doesn’t have quantization and pruning support however, that is in the works. We could either wait for them to complete their work or we could implement a small library by ourselves.

What use case I was trying to target is fast.ai could become a playground where researchers could test out quantisation and pruning on their models and potentially could implement novel algorithms through it’s base support.

If any of you want to learn more about quantization, I have embedded the resources I learnt from below. They were indeed invaluable.

Jacob Benoit et al’s Quantisation Paper (Google)
Raghuraman’s Paper on Quantisation (Google, he’s now at Facebook)
Distiller Docs on Quantisation
Gemmlowp’s Quantisation Tutorial

Tom2718 · August 9, 2019, 7:02pm

I think this is very cool - good job implementing things already! I have been wanting to implement pruning into fastai and would be interested in helping out.

karanchahal · August 9, 2019, 8:11pm

Thank you !

If you want, you can extend the pruning demo to support maybe structured pruning ? This way we could get some speed ups on the kernels of pytorch today instead of waiting for sparse kernels

MuhammadAli · June 6, 2021, 5:36pm

Hi all, is there any way to quantize and prune a fastai (with timm libaray) trained model, so that it can be deployed to mobile.
Currently, I am doing following:

effb3_model=learner_effb3.model.eval()

backend = "qnnpack"

effb3_model.qconfig = torch.quantization.get_default_qconfig(backend)
torch.backends.quantized.engine = backend
model_static_quantized = torch.quantization.prepare(effb3_model, inplace=False)
model_static_quantized = torch.quantization.convert(model_static_quantized, inplace=False)
print_size_of_model(model_static_quantized)

But I am facing following error, while calling the model for inference:

RuntimeError: Could not run 'aten::thnn_conv2d_forward' with arguments from the 'QuantizedCPU' backend. This could be because the operator doesn't exist for this backend, or was omitted during the selective/custom build process (if using custom build). If you are a Facebook employee using PyTorch on mobile, please visit https://fburl.com/ptmfixes for possible resolutions. 'aten::thnn_conv2d_forward' is only available for these backends: [CPU, CUDA, BackendSelect, Named, AutogradOther, AutogradCPU, AutogradCUDA, AutogradXLA, AutogradNestedTensor, UNKNOWN_TENSOR_TYPE_ID, AutogradPrivateUse1, AutogradPrivateUse2, AutogradPrivateUse3, Tracer, Autocast, Batched, VmapMode].

Thanks for any help.