Just received my Intel Arc A770 GPU

Well, this arrived this morning.

Hopefully, I can get Stable Diffusion working on it with OpenVINO. I was hoping the pytorch-directml package would be further along by the time the production cards came out, but I’ll see what I can train with that as it is. Getting fastai to work with pytorch-directml might take more work than I have time for during the course, but I can at least see what breaks.


Would be interesting to see some benchmarks once/if/when fastai starts working on it.

1 Like

I’ll start with plain pytorch using the directml package and see if it trains a baseline model like ResNet. I’ll probably test that tomorrow and will share the results here. Fingers crossed that the in-development pytorch-directml package and the in-development GPU drivers play nice with each other.

Intel needs to prioritize training in addition to inference with these cards if they want to market them for their deep learning capabilities. It kind of sucks if users still need to pay for an Nvidia card (either in the cloud or personal hardware) to train models.


I haven’t seen any technical review so far for intel’s new GPU family and their compabibility to use with torch/tf. Would be very interesting to see training benchmarks and the extent of compability issues. At a 10-12x cheaper price point than nvidia, this would be a very good and cost-efficient GPU to buy, even if it works with some limitations.

1 Like

Update #1

I installed the Arc card on my desktop with the latest driver. It feels way more stable than the preproduction card I got last year just moving around Windows.

I tested the card’s performance on the OpenVINO-Unity project from one of my tutorials. I noticed performance heavily depends on PCIe bandwidth.

I first tried it in the second slot with my Nvidia GPU in the first slot. On my motherboard, this configuration results in PCIe Gen3x4 for the second slot. Framerates in the Unity Editor topped out at around 120fps (fp16). Still better than CPU performance, but not what I would expect from this card.

I then put the Arc in the first slot with PCIe Gen4x16. In this configuration, the card easily maintained 160fps (fp16). Performance doubled in the CLI demo (fewer bottlenecks) that uses the same model.

For reference, using the DirectML execution provider with ONNX Runtime on my Titan RTX tops out at around 140fps (fp32) with the same model. The Arc card hovers around 120fps (fp32) in the DirectML project.

I did notice an odd quirk that might be related to the Xe Matrix Extensions or XMX (think of Tensor cores for Nvidia). The Arc card with the OpenVINO demo seems sensitive to the input resolution.

I use a default resolution of 398x224 (for a 16:9 aspect ratio), which translates to a 384x224 (divisible by 32) input resolution for the YOLOX model. At this resolution, the model detects the same hand gestures with the Arc card as the CPU. However, the confidence scores are much lower, and the bounding box dimensions are slightly different (but still usable).

Moving to an input resolution of 448x256 gets closer to the CPU confidence scores and bounding boxes. I then moved to an input resolution of 896 x 512, and the OpenVINO demo crashed with the Arc card (but not for the CPU or iGPU). It did not crash using an even higher resolution of 1120 x 640 (approximately 65fps for those curious).

None of these issues occurred with the Arc card in the DirectML project, which does not use XMX.

Next, I’ll set up a conda environment with pytorch-directml in wsl2 and see if I can train any models. As far as I know, this would currently be the only way to train models with an Arc card until the main libraries add support for Intel GPUs.


Update #2

I set up a conda environment in WSL with the pytorch-directml package and downloaded the sample repo provided by Microsoft. The pytorch-directml package requires python 3.8, and the sample repo uses torchvision 0.9.0.

I was able to train a ResNet50 model using the sample training script. GPU memory usage was volatile when using a batch size higher than 4. The ResNet50 training script used about 3.6 GB of GPU memory at a batch size of 4 but spikes to nearly using all 16 GB at a batch size of 8.

I then attempted to train the style transfer model included with the pytorch examples repo and hit the wall of unimplemented operators.

For those curious, here is the PyTorch DirectML Operator Roadmap. There is some basic stuff missing still.


Update #3
A quick update. I updated OpenVINO from 2022.1 to the new 2022.2, which resolved the accuracy/confidence score issues I mentioned earlier.


My contact at Intel said work is ongoing for PyTorch and TensorFlow versions that should work with Arc GPUs. I probably won’t be able to talk about anything pre-release (not that I have access to the repositories yet), but I’ll see what I can share once it’s ready to try.


@cjmills really interesting results, would you be able to say the price/performance ratio is competitive for someone looking to locally train and tinker with models? Also, out of curiosity are you using WSL2? Was it possible to setup GPU acceleration with the Intel drivers? I know its trivial for NVIDIA cards.

@Carson Unfortunately, I don’t consider training on the Arc cards viable until either the PyTorch-DirectML package gets updated or the main PyTorch and TensorFlow libraries add support. I did not test the TensorFlow-directml package, so maybe that’s further along than the PyTorch version.

The tutorial projects I mentioned in Update #1 are projects for the Unity game engine with plugins for OpenVINO and ONNX Runtime with the DirectML execution provider.

OpenVINO is Intel’s optimized inference library. DirectML lets you run deep learning models on any GPU with DirectX12 support in Windows. Both projects run directly in Windows.

The OpenVINO project uses FP16 precision, which can leverage XMX (Intel’s equivalent of Nvidia’s tensor cores). The DirectML project uses FP32 precision.

The PyTorch-DirectML package allows you to use DirectML (i.e., any GPU with DirectX12 support) to train models rather than only for inference. It’s available as a pip package, but the current version is old and still missing many operators.

As mentioned in Update #2, I set up a conda environment in WSL to test training with the PyTorch-DirectML package on the Arc card.

I believe the Arc cards only support Linux with kernel version 6, and my dual-boot Ubuntu-Windows desktop is still on 5.15.

I’m cataloging my experience on my blog, so let me know if there are some missing details you would like added to the initial post.

1 Like

Bummer but understandable, hopefully one day soon we’ll be able to get a few of these cheaper cards to be used for local ml!

As I mentioned earlier in the thread, they are actively working on adding support. So, hopefully, it won’t be too long before the main PyTorch and TensorFlow libraries support the new cards.

They work well for inference (especially with OpenVINO 2022.2), but you still need a separate GPU (either in the cloud or locally) to train for now.

1 Like

@cjmills This is a really interesting thread. Do you think DirectML will be able to train using pytorch in 1 or 2 years?

It depends on how much work Microsoft puts into it. I don’t know anyone at Microsoft, so I can’t speak to that. Hopefully, the main Pytorch library will get support for Intel GPUs before long, so DirectML will not be required.


Support for Pytorch on Intel GPUs is now available via Intel extensions for Pytorch

Yeah, I have not had a chance to experiment with the new extensions for PyTorch and TensorFlow on the Arc card. I’m hoping to test them out later this week.


I finally set time aside to get the Arc A770 (16GB) working with PyTorch in Ubuntu 22.04. I’ll do more testing tomorrow, but here is a quick ResNet50 benchmark. There were some irritations, but the setup process is not too bad now that I’ve done it. I plan to make a blog post for it, starting from a clean Ubuntu 22.04 install. I’ll test it in WSL2 as well. I still need to see how to view GPU usage in Ubuntu.

import torch
import torchvision.models as models
import intel_extension_for_pytorch as ipex
model = models.resnet50(pretrained=True)
data = torch.rand(1, 3, 224, 224)
model = model.to('xpu')
data = data.to('xpu')
model = ipex.optimize(model)
with torch.no_grad():
    preds = model(data)
3.48 ms ± 29.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
1 Like

Inference speed seems basically the same for FP32, FP16, and BF16 right now.

intel_gpu_top provides some usage information.

I did some more testing, including attempting to train a model. Uh, yeah, it’s not ready yet. :face_with_diagonal_mouth:
Even using the sample training code from the documentation for Intel’s PyTorch extension, the loss does not decrease. I verified there was not some issue with the code by running it (minus the Intel GPU bits) on Kaggle, and it worked perfectly. Training is also absurdly slow compared to inference speed.

1 Like