MAC M1 GPUs

nhenseler · May 12, 2022, 1:05pm

Hi,

lesson 1 mentions that Apple does not support Nvidia GPUs and hence it makes no sense to run the course notebooks on a Mac.
However the newer Apple Macs with M1 processors come with up to 32 GPUs.
What would it involve to make use of these GPUs?

thanks
Norbert

mike.moloch · May 12, 2022, 1:09pm

As I understand, for fastai to make use of these GPUs, the underlying pytorch framework would need to work with it. Pytorch team seems to be working on it, but I haven’t heard any pytorch builds that can leverage the M1 architecture (yet.).

EDIT: This issue in the pytorch github has some discussion on what’s been going on in this regard:

github.com/pytorch/pytorch

GPU acceleration for Apple's M1 chip?

opened 10:05PM - 10 Nov 20 UTC

dexios1

module: performance triaged

## 🚀 Feature Hi, I was wondering if we could evaluate PyTorch's performance… on Apple's new M1 chip. I'm also wondering how we could possibly optimize Pytorch's capabilities on M1 GPUs/neural engines. I know the issue of supporting acceleration frameworks outside of CUDA has been discussed in previous issues like #488..but I think this is worth a revisit. In Apple's big reveal today, we learned that Apple's on a roll with 50% of product usage growth being as a result of new users this year. Given that Apple is moving to these in-house designed chips, enhanced support for these chips could make deep learning on personal laptops a better experience for many researchers and engineers. I think this really aligns with PyTorch's theme of facilitating deep learning from research to production. I'm not quite sure how this should go down. But these could be important: 1. A study on M1 chips 2. Evaluation of Pytorch's performance on M1 chips 3. Assessment on M1's compatibility with acceleration frameworks compatible with PyTorch (best bet would be CUDA transpilation..from what I see at #488) 4. Investigating enhancements to PyTorch that can take advantage of M1's ML features. cc @VitalyFedyunin @ngimel

nhenseler · May 13, 2022, 12:49am

thanks Mike.
I now subscribed to the GitHub issue
By browsing through the issue comments I get the impression that it would be good to get a basic understanding on how PyTorch utilises GPUs and how CUDA works.

mike.moloch · May 13, 2022, 1:13am

The impression I got from some of the posts in that thread was that Apple’s politics with FB have something to do with it. They facilitated TF Metal development but not so much on the pytorch/torch side and to get similar perormance the pytorch team would have to do things from scratch, which is a daunting task.

Would be nice to have pytorch work on it, but looks like it’s going to be a while before we see it on apple’s M architecture.

This article has some numbers on performance of numpy etc and it seems M1 Max is able to get about 8 Teraflops for 1/8th the power draw of a 3090.

nhenseler · May 17, 2022, 12:56am

thanks for sharing this article

tcapelle · May 17, 2022, 7:33am

Hi, I wrote an article benchmarking the different Nvidia Gpus and the M1/pro/max/ultra GPUs. They are very very slow compared to CUDA, so don’t expect too much:

Don’t get me wrong, I love the new 14", and I think it’s probably the best PC on the market right now, but no deep learning machine. Anyway, I don’t think a laptop is a good deep learning machine.

gagan · May 18, 2022, 3:37pm

This just got announced.

philipturner · May 18, 2022, 4:28pm

The PyTorch backend doesn’t run on Intel Macs (yet), unlike the TensorFlow backend. That means you’ll have to ensure you have an Apple silicon Mac before running notebooks with GPU acceleration. You’ll also need to watch out for operators like SVD and Cholesky decomposition, which MPS does not support. This is different from CUDA, where almost every operator can run on the GPU.

mike.moloch · May 18, 2022, 5:19pm

Wow! nice! can’t wait for @tcapelle results with v1.12 since I’m looking into getting a 14" MBP myself soon!

gagan · May 18, 2022, 5:36pm

I think more work is needed by both Apple and Pytorch teams on this because the average speedup shown on the blog is only 15-20 times over the CPU. Although they have not shown the actual time taken so overall it “maybe” comparable to a NVIDIA laptop GPU. Maybe the stable version will have a better performance.

mike.moloch · May 18, 2022, 5:41pm

Yes I agree. And this is on their M1 Ultra chip (average of 7x speedup over CPU for training)

https://pytorch.org/assets/images/METAPT-002-BarGraph-02.gif(image larger than 4 MB)

*Testing conducted by Apple in April 2022 using production Mac Studio systems with Apple M1 Ultra, 20-core CPU, 64-core GPU 128GB of RAM, and 2TB SSD. Tested with macOS Monterey 12.3, prerelease PyTorch 1.12, ResNet50 (batch size=128), HuggingFace BERT (batch size=64), and VGG16 (batch size=64). Performance tests are conducted using specific computer systems and reflect the approximate performance of Mac Studio.

tcapelle · May 18, 2022, 8:57pm

Here you go: Weights & Biases

mike.moloch · May 18, 2022, 10:08pm

Thanks Thomas!

The M1 GPU over CPU gain is 3x on average (both for total time and for image throughput).

The A6000/M1GPU Image throughput though is 57x while the A6000/M1GPU overall time is 6.8x

I guess I was expecting better image throughput on the M1 since we’ve heard so much about Unified RAM with that architecture.

mike.moloch · May 18, 2022, 11:38pm

BTW, just for the heck of it I wanted to compare these numbers with my 1070ti, but copying the train.py file is tricky. There’s no way to copy just the text and if I select and paste it, it introduces some unicode characters in the source file.

After cleaning it up I was able to run it, but the download of the PETS dataset failed on my machine after downloading about 8000 files (out of 25000).

I’ll try to re-run it and see if it restarts from where it left off (at.download() is where it died.)

EDIT: So, I reran it and it just went with the 7300 images it had already downloaded. I’m not sure how to make heads or tails of this experiment but it’s at Weights & Biases if anyone wants to look at it.

mike.moloch · May 19, 2022, 2:18am

On a 1070ti it took 72 seconds to run an epoch
On a Xeon 8c from 2012: it took ~945 seconds (run with --device=‘cpu’ flag)

So, the M1 GPU is 1/2 as fast as a 1070ti
The M2 CPU is 6.5x faster than a Xeon 8c

tcapelle · May 20, 2022, 1:29pm

I posted everything in this repo:

msp · May 20, 2022, 3:56pm

pretty underwhelming so far, but I guess they are still working on it

tcapelle · May 20, 2022, 9:09pm

Training Pets for 1 epochs with batch size=64 on a Macbook Pro 14 M1 Pro 16Gb ram

mike.moloch · May 20, 2022, 9:10pm

Early days!

mike.moloch · June 12, 2022, 12:27am

Alex Ziskind and Daniel Bourke posted a collab on Youtube recently that compared runtimes for M1 Pro, M1 Max, M1 Ultra and an Nvidia Titan (Spoiler alert! Titan blew the Ultra out of the water, but still early days!)