Just received my Intel Arc A770 GPU

cjmills · May 28, 2023, 6:44pm

After waiting several months, I decided to give the Arc card another try on Ubuntu with Intel’s PyTorch extension. While the setup and documentation need refinement, I am happy to report it works now!

I used the training code from my recent beginner PyTorch tutorial for testing.

Initially, the backward pass during training was incredibly slow. The first epoch was nearly twice as slow as the free GPU tier on Google Colab. Fortunately, the fix involved setting a single environment variable.

After setting IPEX_XPU_ONEDNN_LAYOUT=1, the total training time is within 10% of my Titan RTX on the same system. The gap would be slightly wider if I compiled the model on the Titan with PyTorch 2.0. Intel’s PyTorch extension is still on PyTorch 1.13.

The final loss and accuracy values fluctuate slightly, even when using fixed seed values for PyTorch, NumPy, and Python. However, they stay pretty close to the results on my Nvidia GPU.

Here is a screenshot of the training session with the Arc A770:

Here is a link to the training session with the Titan RTX

Titan RTX training session

Epochs: 100%|█████████| 3/3 [11:15<00:00, 224.96s/it]
Train: 100%|██████████| 4324/4324 [03:29<00:00, 21.75it/s, accuracy=0.894, avg_loss=0.374, loss=0.0984, lr=0.000994]
Eval: 100%|██████████| 481/481 [00:17<00:00, 50.42it/s, accuracy=0.975, avg_loss=0.081, loss=0.214, lr=]
Train: 100%|██████████| 4324/4324 [03:28<00:00, 22.39it/s, accuracy=0.968, avg_loss=0.105, loss=0.0717, lr=0.000462]
Eval: 100%|██████████| 481/481 [00:16<00:00, 55.14it/s, accuracy=0.988, avg_loss=0.0354, loss=0.02, lr=]
Train: 100%|██████████| 4324/4324 [03:28<00:00, 21.94it/s, accuracy=0.99, avg_loss=0.0315, loss=0.00148, lr=4.03e-9]
Eval: 100%|██████████| 481/481 [00:16<00:00, 53.87it/s, accuracy=0.995, avg_loss=0.0173, loss=0.000331, lr=]

There is much more testing to do, but I think it’s at a point where I feel comfortable making a tutorial for Ubuntu.

cjmills · May 29, 2023, 2:11am

I just confirmed it also works on WSL

cjmills · May 29, 2023, 8:35pm

Total training time in WSL is ≈34% slower than in pure Ubuntu with the dataset in the same virtual hard disk.

Here is a screenshot of the GPU usage when running the training code on the Arc card in WSL.

There is an additional ≈20% increase in training time when using a dataset on a drive other than the virtual hard disk that stores the WSL-Ubuntu install.

That, plus the hassle of setting permissions for folders outside the virtual hard drive, means you should probably only do that if your C drive has limited space.

AllenK · May 29, 2023, 10:59pm

Thanks for quantifying this performance hit.

I’ve had a performance drop of 30% of more, (especially if many small files) when I swapped data to another drive outside of the wsl file system, due to space limits.

Ultimately the best solution in that case is to move the whole wsl installation to another drive with more space, so that you can have all the data within wsl

MS have a specific note in wsl doc about it “For the fastest performance speed, store your files in the WSL file system”

I don’t think people realise just how large a performance difference it is, the docs should quantify it. For ML use cases 20-30% is a big hit for training, loading a word document probably not such an issue.

cjmills · May 29, 2023, 11:04pm

Yeah, the cumulative performance hit of using WSL is one of the reasons why I stick to dual-booting on my desktop. That and the other headaches I’ve encountered.

daniel.sabzi · May 30, 2023, 10:40pm

@ cjmills
Awesome,

Thanks Christian.
Waiting for the tutorial.

Three_Squirrels · June 18, 2023, 7:48pm

Hi, thanks a loooooooooot for your work!!! I’m looking for this kind of comparison for several months.

cjmills · June 18, 2023, 11:37pm

I do not know if you updated your post because you found the tutorial, but here is a link, just in case.

Getting Started with Intel’s PyTorch Extension for Arc GPUs on Ubuntu

It includes a link to the jupyter notebook with the modified training code.

intel-arc-pytorch-timm-image-classifier-training.ipynb

Three_Squirrels · June 19, 2023, 7:09am

Thanks a lot! Great testing for this series. Hope ipex 2.0 for xpu will work better!

cjmills · August 30, 2023, 7:44pm

I’ve been testing the latest release of Intel’s PyTorch extension on native Ubuntu and Windows, and I wanted to share my initial findings here before writing the blog post.

Native Ubuntu

First, I tested performance with the image classification notebook I used previously. Training time on Ubuntu was within six seconds of version 1.13.120+xpu of the extension. The final validation accuracy was identical.

Next, I tested the inference speed for Stable Diffusion 2.1 with the Hugging Face Diffusers notebook I used in this post. Inference speed when using bloat16 is approximately 25% faster than with the previous version of Intel’s PyTorch extension.

Using float16 has the same inference speed, but the model produces NaN values. The torch.compile() method seems to expect CUDA to be enabled, and the compiled model throws an error when I try to use it.

Last, I tried to run the training notebook for my recent YOLOX object detection tutorial. This notebook was the only one that did not work as expected. First, I had to replace some view operations in the loss function with reshape operations to handle non-contiguous data.

The training code ran with those changes, but the loss decreased much more slowly than on Nvidia GPUs and never reached usable performance. I tested inference performance with model checkpoints trained on my Nvidia GPU and got identical inference predictions, so the issue does not appear to be with the model itself. The training code also achieved usable accuracy when using the CPU, so it might just be a bug with the extension.

Training time was about 11 minutes for a single pass through the training set on the Arc GPU. For reference, the same takes about 2 minutes on an RTX 4090 (my Titan RTX died a bit ago).

I have not attempted to compile the extension from the source code to see if that provides different results.

Native Windows

Getting the extension to work on native Windows was a bit of a hassle, but the process is not too bad now that I know the steps. Most of the frustration came from not knowing I needed to disable the iGPU in Windows for the extension to find the Arc GPU.

Fortunately, those initial frustrations were worth it, as the extension works quite well on native Windows.

The total training time for the image classification notebook was slower than native Ubuntu but faster than WSL. That’s about as well as I could expect, given PyTorch on native Windows tends to be slower than Ubuntu, and Python multiprocessing takes longer to start on Windows.

I needed to replace the same view operations with reshape operations in the loss function for the YOLOX training code on Windows. However, this time, the notebook produced a model that was comparably accurate to one trained on Nvidia GPUs. I have no idea why the Windows version of the extension works when the Ubuntu version does not.

Total training time was a bit slower than native Ubuntu but still much faster than the free tier of Google Colab.

The Stable Diffusion inference notebook, also to my surprise, was about 25% faster than Ubuntu.

I’ll see how much I can streamline the setup process for Windows before making a tutorial. The oneAPI toolkit takes up quite a bit of space.

I might also try compiling the Ubuntu version to see if that resolves the issues with the YOLOX training code.

cjmills · September 5, 2023, 6:15pm

Well, this is frustrating. I was about to wrap up my tutorial for setting up the extension on Windows and decided to test the installation steps by uninstalling everything and starting from scratch.

The installation process worked as expected, but now I get the same behavior for the YOLOX training code as in Ubuntu. Also, the Stable Diffusion inference notebook is about 1it/s slower than previously.

I’m now wondering if I had something installed before I originally installed the extension that caused the different behavior.

romrom · November 26, 2023, 4:54am

cjmills:

I finally set time aside to get the Arc A770 (16GB) working with PyTorch in Ubuntu 22.04. I’ll do more testing tomorrow, but here is a quick ResNet50 benchmark. There were some irritations, but the setup process is not too bad now that I’ve done it. I plan to make a blog post for it, starting from a clean Ubuntu 22.04 install. I’ll test it in WSL2 as well. I still need to see how to view GPU usage in Ubuntu.
import torch
import torchvision.models as models
import intel_extension_for_pytorch as ipex

Is that on a system with Intel or AMD CPU? @cjmills

cjmills · November 26, 2023, 5:13am

An Intel i7-11700K, specifically.

romrom · November 26, 2023, 5:18am

@cjmills Thanks! I’m trying to figure out if I can use intel_extension_for_pytorch with the A770 and an AMD CPU or if the Arc GPU and ipex need an Intel CPU. So far, I couldn’t find any answers.

romrom · November 26, 2023, 5:22am

How did the WSL2 test go? I saw that the Arc A-series discrete graphics family does not support GPU virtualization technology.
https://www.intel.com/content/www/us/en/support/articles/000093216/graphics/processor-graphics.html

Doesn’t that prevent the A770 from being available under WSL2 Ubuntu? @cjmills

cjmills · November 26, 2023, 5:36am

romrom · November 26, 2023, 5:43am

Wow, I didn’t expect such a huge performance gap. However, so virtualization actually works contrary to what Intel says? @cjmills

romrom · November 26, 2023, 5:49am

@cjmills Was your monitor connected to the UHD graphics in both cases?

cjmills · November 26, 2023, 5:49am

I have not explored it beyond WSL.

That performance gap is not unique to the ARC card BTW, it’s an issue with WSL that has been around since I first tested WSL for deep learning projects in 2020.

For the best performance, I recommend native Linux, then native Windows, then WSL as the last option.

cjmills · November 26, 2023, 5:50am

Misread that, it was connected to the ARC card in all testing