I’m really interested in learning how to optimize large models for faster execution (inference) on given hardware with a focus on improving throughput and latency. I’d love to explore key techniques like model distillation, pruning, quantization, specialized CUDA kernels etc.
Can you fine folks recommend courses, books, articles, or comprehensive blog posts that provide practical examples and in-depth insights on these topics?
Any suggestions would be greatly appreciated. Thanks!