Fastai on Apple M1

mike.moloch · May 23, 2022, 3:55pm

Some more numbers by reddit user u/seraschka (is it the Sebastian Raschka?)

rawmean · May 24, 2022, 10:36pm

Thank you for this patch. Can you please specify how I can apply this patch? Do I need to uninstall fastai (which I’ve installed using conda) first?

pellet · May 24, 2022, 11:26pm

Yeah uninstall fastai and pytorch.
Use the updated environment.yml to install the new conda environment and make sure it installs pytorch 1.13. After that you can build the patched fastai module by running pip install -e fastai

rawmean · May 25, 2022, 12:06am

Thanks. I still get the ‘fake _loader’ error. Do I need to specify any new options related to mps in the dataloader?

rawmean · May 25, 2022, 2:12am

I also had to add the highlighted line as line 29 to load.py.

But, now this is the problem. I guess we cannot use any of the existing model architectures until PyTorch implements them.
NotImplementedError: The operator 'aten::adaptive_max_pool2d.out' is not current implemented for the MPS device.

pellet · May 25, 2022, 4:28am

Ok, I thought I had merged the fix from fast.ai master which set pin_memory_device to an empty string. I didn’t have any trouble running the first part of the first fastbook notebook but haven’t tested it much… I didn’t see that error pop up for me, might need to fallback to cpu for that operation.

gerardo · May 25, 2022, 4:41pm

torch.backends.mps.is_available() is the equivalent to torch.cuda.is_available()

on this forum post @sgugger mentioned.

What is the equivalent when you have an Apple M1 processor?

learn = vision_learner(dls, resnet18, metrics=error_rate)
does not have device to be assigned

gerardo · May 25, 2022, 8:00pm

Honestly have anyone run successfully a test with fastai on Apple M1?
Can you share a notebook?

I found this notebook but does not include any fastai
You can see the GPU (Activity Monitor → Window → GPU history)

github.com

mrdbourke/pytorch-apple-silicon/blob/main/01_cifar10_tinyvgg.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "f43f5b74",
   "metadata": {},
   "source": [
    "# PyTorch on Mac CIFAR10 TinyVGG Benchmark\n",
    "\n",
    "> **Note:** As of 21 May 2022, [PyTorch accelerated training for Mac](https://pytorch.org/blog/introducing-accelerated-pytorch-training-on-mac/) is in beta. And is only available through the nightly version of PyTorch.\n",
    "\n",
    "The following notebook tests the speed at which a given device can perform training iterations on the [CIFAR10 dataset](https://pytorch.org/vision/stable/generated/torchvision.datasets.CIFAR10.html) (10 classes, 50,000 training images, 10,000 testing images) training the [TinyVGG architecture from the CNN Explainer as a base](https://poloclub.github.io/cnn-explainer/).\n",
    "\n",
    "It's designed to be a simple test to compare PyTorch compute speeds on Apple Silicon CPU/GPUs.\n",
    "\n",
    "Ideally, PyTorch runs faster on Apple Silicon GPUs than on Apple Silicon CPUs.\n",
    "\n",
    "| Model | Dataset | Train Size | Test Size |\n",
    "| ----- | ----- | ----- | ----- |\n",
    "| [TinyVGG](https://poloclub.github.io/cnn-explainer/) (trained from scratch) | CIFAR10 (from `torchvision.datasets`) | ~50,000 (32x32x3) images | ~10,000 (32x32x3) images|\n",

This file has been truncated. show original

pellet · May 26, 2022, 1:27pm

Actually I fixed this error by chucking this in the top of the notebook i was using(before import torch)

#Crashes unless can fallback to cpu if an operation missing
os.environ["PYTORCH_ENABLE_MPS_FALLBACK"] = "1"

pellet · May 26, 2022, 1:48pm

Yeah I got it running with some MPS support, you can see in this screenshot it took 4:30 to fine tune the resnet34 model in the first notebook 01_intro.ipynb :

Here is a link to my notebook: deep-learning-for-coders/01_intro.ipynb at master · pellet/deep-learning-for-coders · GitHub

mike.moloch · May 26, 2022, 2:11pm

Hi Ben, what are the specs of the M1 machine on which you got these results?

rawmean · May 26, 2022, 7:21pm

But with os.environ["PYTORCH_ENABLE_MPS_FALLBACK"] = "1", fast would run all the built-in architectures (eg, resnet34) on CPU, so it’s as if MPS does not exist and the nightly-build of PyTorch is and the changes that you have made to fastai are not needed.

I think we just need to wait for the PyTorch team to implement some of the layers that are used in all of the built-in architectures.

pellet · May 27, 2022, 3:22am

M1 air

pellet · May 27, 2022, 3:23am

Nah it’s just a fallback - it should only be running functions on the cpu which don’t exist on the gpu. I think the missing function might not even be called often though I could be wrong, there probably isn’t much performance difference.
One pathway to compatibility could be updating the function for detecting the default device to use mps if cuda is not available, I mentioned something similar above.

rawmean · May 27, 2022, 4:02pm

The missing layer is the last layer of the model so it is called all the time. Also, for a model to run on MPS, all layers must be supported on MPS. Models cannot run partially on one device and partially on another device.

Your own number that shows one epoch took more than 4 minutes is a good indication of this.

mike.moloch · May 27, 2022, 5:11pm

@tcapelle posted some new numbers recently for M1 Pro Max (not sure if it’s fast.ai specific though):

tcapelle · May 31, 2022, 5:52pm

I am waiting for fastai on M1, as it has so many dependencies (like spacy, tokenizers, etc…)

iTapAndroid · August 19, 2022, 1:18am

I’ve setup a condo environment on my MacBook Pro with 24 Core M1 Max. I’m able to run the first notebook from the book without any issues. The only change I’ve made is this:
import os os.environ["OMP_NUM_THREADS"] = "1"

When creating a the conda environment, use conda install -c fastchan fastai fastbook jupyterlab sentencepiece.

sentencepiece is an important dependency, but condo doesn’t install it by default for some reason. if you don’t install it, you’ll end up getting some weird dependency error.

Its training the model in reasonable time. Its not as fast as 3080/3090 but its decent, I feel like I can run this whole thing locally on my Mac.

It is in-fact faster than running it on a free paperspace GPU. It took paperspace 1:37 for the same training.

Here’s a video of GPU memory and cores being triggered when training starts:

I don’t actually understand how/why this works with the OMP_NUM_THREADS = 1. It in fact is confusing. The last time I tried this setup about a year ago, this exact setting only made it possible to train the model with CPU, that too on a single thread, which took forever. I know since then PyTorch has officially started supporting M1 Macs, so maybe that’s what has fixed it. If I don’t use this setting, I end up getting an error like this, and it seems to be getting stuck into some infinite loop(sorry they’ll only let me upload one image). I might share more later.

Not sure how it works, but its working, so I’ll take it

mike.moloch · August 20, 2022, 11:19pm

This is quite interesting. It seems fastai started working on M1 processors … What are your MBP specs? And what version of pytorch was it that you were running? These numbers are equivalent to my Dell t3600 Xeon box with 1070ti … so this is definitely “useable” now. Thanks for posting!

iTapAndroid · August 21, 2022, 12:59am

fastai 2.7.9 py_0 fastchan
fastbook 0.0.24 py_0 fastchan
fastcore 1.5.18 py_0 fastchan
pytorch 1.12.1 py3.10_0 fastchan
python 3.10.4 hbdb9e5c_0

that dell comparison… ouch