MAC M1 GPUs

Hi,

lesson 1 mentions that Apple does not support Nvidia GPUs and hence it makes no sense to run the course notebooks on a Mac.
However the newer Apple Macs with M1 processors come with up to 32 GPUs.
What would it involve to make use of these GPUs?

thanks
Norbert

1 Like

As I understand, for fastai to make use of these GPUs, the underlying pytorch framework would need to work with it. Pytorch team seems to be working on it, but I haven’t heard any pytorch builds that can leverage the M1 architecture (yet.).

EDIT: This issue in the pytorch github has some discussion on what’s been going on in this regard:

3 Likes

thanks Mike.
I now subscribed to the GitHub issue
By browsing through the issue comments I get the impression that it would be good to get a basic understanding on how PyTorch utilises GPUs and how CUDA works.

1 Like

The impression I got from some of the posts in that thread was that Apple’s politics with FB have something to do with it. They facilitated TF Metal development but not so much on the pytorch/torch side and to get similar perormance the pytorch team would have to do things from scratch, which is a daunting task.

Would be nice to have pytorch work on it, but looks like it’s going to be a while before we see it on apple’s M architecture.

This article has some numbers on performance of numpy etc and it seems M1 Max is able to get about 8 Teraflops for 1/8th the power draw of a 3090.

thanks for sharing this article

1 Like

Hi, I wrote an article benchmarking the different Nvidia Gpus and the M1/pro/max/ultra GPUs. They are very very slow compared to CUDA, so don’t expect too much:

Don’t get me wrong, I love the new 14", and I think it’s probably the best PC on the market right now, but no deep learning machine. Anyway, I don’t think a laptop is a good deep learning machine.

1 Like

This just got announced.

3 Likes

The PyTorch backend doesn’t run on Intel Macs (yet), unlike the TensorFlow backend. That means you’ll have to ensure you have an Apple silicon Mac before running notebooks with GPU acceleration. You’ll also need to watch out for operators like SVD and Cholesky decomposition, which MPS does not support. This is different from CUDA, where almost every operator can run on the GPU.

Wow! nice! can’t wait for @tcapelle results with v1.12 since I’m looking into getting a 14" MBP myself soon! :smiley:

1 Like

I think more work is needed by both Apple and Pytorch teams on this because the average speedup shown on the blog is only 15-20 times over the CPU. Although they have not shown the actual time taken so overall it “maybe” comparable to a NVIDIA laptop GPU. Maybe the stable version will have a better performance. :crossed_fingers:

1 Like

Yes I agree. And this is on their M1 Ultra chip (average of 7x speedup over CPU for training)

*Testing conducted by Apple in April 2022 using production Mac Studio systems with Apple M1 Ultra, 20-core CPU, 64-core GPU 128GB of RAM, and 2TB SSD. Tested with macOS Monterey 12.3, prerelease PyTorch 1.12, ResNet50 (batch size=128), HuggingFace BERT (batch size=64), and VGG16 (batch size=64). Performance tests are conducted using specific computer systems and reflect the approximate performance of Mac Studio.

2 Likes

Here you go: Weights & Biases

2 Likes

Thanks Thomas!

The M1 GPU over CPU gain is 3x on average (both for total time and for image throughput).

The A6000/M1GPU Image throughput though is 57x while the A6000/M1GPU overall time is 6.8x

I guess I was expecting better image throughput on the M1 since we’ve heard so much about Unified RAM with that architecture.

BTW, just for the heck of it I wanted to compare these numbers with my 1070ti, but copying the train.py file is tricky. There’s no way to copy just the text and if I select and paste it, it introduces some unicode characters in the source file.

After cleaning it up I was able to run it, but the download of the PETS dataset failed on my machine after downloading about 8000 files (out of 25000).

I’ll try to re-run it and see if it restarts from where it left off (at.download() is where it died.)

EDIT: So, I reran it and it just went with the 7300 images it had already downloaded. I’m not sure how to make heads or tails of this experiment but it’s at Weights & Biases if anyone wants to look at it.

On a 1070ti it took 72 seconds to run an epoch
On a Xeon 8c from 2012: it took ~945 seconds (run with --device=‘cpu’ flag)

So, the M1 GPU is 1/2 as fast as a 1070ti
The M2 CPU is 6.5x faster than a Xeon 8c

1 Like

I posted everything in this repo:

2 Likes

pretty underwhelming so far, but I guess they are still working on it

1 Like

Training Pets for 1 epochs with batch size=64 on a Macbook Pro 14 M1 Pro 16Gb ram

1 Like

Early days! :sunglasses: