Just received my Intel Arc A770 GPU

After waiting several months, I decided to give the Arc card another try on Ubuntu with Intel’s PyTorch extension. While the setup and documentation need refinement, I am happy to report it works now! :grin:

I used the training code from my recent beginner PyTorch tutorial for testing.

Initially, the backward pass during training was incredibly slow. The first epoch was nearly twice as slow as the free GPU tier on Google Colab. Fortunately, the fix involved setting a single environment variable.

After setting IPEX_XPU_ONEDNN_LAYOUT=1, the total training time is within 10% of my Titan RTX on the same system. The gap would be slightly wider if I compiled the model on the Titan with PyTorch 2.0. Intel’s PyTorch extension is still on PyTorch 1.13.

The final loss and accuracy values fluctuate slightly, even when using fixed seed values for PyTorch, NumPy, and Python. However, they stay pretty close to the results on my Nvidia GPU.

Here is a screenshot of the training session with the Arc A770:

Here is a link to the training session with the Titan RTX

Epochs: 100%|█████████| 3/3 [11:15<00:00, 224.96s/it]
Train: 100%|██████████| 4324/4324 [03:29<00:00, 21.75it/s, accuracy=0.894, avg_loss=0.374, loss=0.0984, lr=0.000994]
Eval: 100%|██████████| 481/481 [00:17<00:00, 50.42it/s, accuracy=0.975, avg_loss=0.081, loss=0.214, lr=]
Train: 100%|██████████| 4324/4324 [03:28<00:00, 22.39it/s, accuracy=0.968, avg_loss=0.105, loss=0.0717, lr=0.000462]
Eval: 100%|██████████| 481/481 [00:16<00:00, 55.14it/s, accuracy=0.988, avg_loss=0.0354, loss=0.02, lr=]
Train: 100%|██████████| 4324/4324 [03:28<00:00, 21.94it/s, accuracy=0.99, avg_loss=0.0315, loss=0.00148, lr=4.03e-9]
Eval: 100%|██████████| 481/481 [00:16<00:00, 53.87it/s, accuracy=0.995, avg_loss=0.0173, loss=0.000331, lr=]

There is much more testing to do, but I think it’s at a point where I feel comfortable making a tutorial for Ubuntu.

1 Like

I just confirmed it also works on WSL

Total training time in WSL is ≈34% slower than in pure Ubuntu with the dataset in the same virtual hard disk.

Here is a screenshot of the GPU usage when running the training code on the Arc card in WSL.

There is an additional ≈20% increase in training time when using a dataset on a drive other than the virtual hard disk that stores the WSL-Ubuntu install.

That, plus the hassle of setting permissions for folders outside the virtual hard drive, means you should probably only do that if your C drive has limited space.

Thanks for quantifying this performance hit.

I’ve had a performance drop of 30% of more, (especially if many small files) when I swapped data to another drive outside of the wsl file system, due to space limits.

Ultimately the best solution in that case is to move the whole wsl installation to another drive with more space, so that you can have all the data within wsl

MS have a specific note in wsl doc about it “For the fastest performance speed, store your files in the WSL file system

I don’t think people realise just how large a performance difference it is, the docs should quantify it. For ML use cases 20-30% is a big hit for training, loading a word document probably not such an issue.

Yeah, the cumulative performance hit of using WSL is one of the reasons why I stick to dual-booting on my desktop. That and the other headaches I’ve encountered.

@ cjmills
Awesome,

Thanks Christian.
Waiting for the tutorial.

Hi, thanks a loooooooooot for your work!!! I’m looking for this kind of comparison for several months.

I do not know if you updated your post because you found the tutorial, but here is a link, just in case.

It includes a link to the jupyter notebook with the modified training code.

Thanks a lot! Great testing for this series. Hope ipex 2.0 for xpu will work better!

I’ve been testing the latest release of Intel’s PyTorch extension on native Ubuntu and Windows, and I wanted to share my initial findings here before writing the blog post.

Native Ubuntu

First, I tested performance with the image classification notebook I used previously. Training time on Ubuntu was within six seconds of version 1.13.120+xpu of the extension. The final validation accuracy was identical.

Next, I tested the inference speed for Stable Diffusion 2.1 with the Hugging Face Diffusers notebook I used in this post. Inference speed when using bloat16 is approximately 25% faster than with the previous version of Intel’s PyTorch extension.

Using float16 has the same inference speed, but the model produces NaN values. The torch.compile() method seems to expect CUDA to be enabled, and the compiled model throws an error when I try to use it.

Last, I tried to run the training notebook for my recent YOLOX object detection tutorial. This notebook was the only one that did not work as expected. First, I had to replace some view operations in the loss function with reshape operations to handle non-contiguous data.

The training code ran with those changes, but the loss decreased much more slowly than on Nvidia GPUs and never reached usable performance. I tested inference performance with model checkpoints trained on my Nvidia GPU and got identical inference predictions, so the issue does not appear to be with the model itself. The training code also achieved usable accuracy when using the CPU, so it might just be a bug with the extension.

Training time was about 11 minutes for a single pass through the training set on the Arc GPU. For reference, the same takes about 2 minutes on an RTX 4090 (my Titan RTX died a bit ago).

I have not attempted to compile the extension from the source code to see if that provides different results.

Native Windows

Getting the extension to work on native Windows was a bit of a hassle, but the process is not too bad now that I know the steps. Most of the frustration came from not knowing I needed to disable the iGPU in Windows for the extension to find the Arc GPU.

Fortunately, those initial frustrations were worth it, as the extension works quite well on native Windows.

The total training time for the image classification notebook was slower than native Ubuntu but faster than WSL. That’s about as well as I could expect, given PyTorch on native Windows tends to be slower than Ubuntu, and Python multiprocessing takes longer to start on Windows.

I needed to replace the same view operations with reshape operations in the loss function for the YOLOX training code on Windows. However, this time, the notebook produced a model that was comparably accurate to one trained on Nvidia GPUs. I have no idea why the Windows version of the extension works when the Ubuntu version does not.

Total training time was a bit slower than native Ubuntu but still much faster than the free tier of Google Colab.

The Stable Diffusion inference notebook, also to my surprise, was about 25% faster than Ubuntu.

I’ll see how much I can streamline the setup process for Windows before making a tutorial. The oneAPI toolkit takes up quite a bit of space.

I might also try compiling the Ubuntu version to see if that resolves the issues with the YOLOX training code.

1 Like

Well, this is frustrating. I was about to wrap up my tutorial for setting up the extension on Windows and decided to test the installation steps by uninstalling everything and starting from scratch.

The installation process worked as expected, but now I get the same behavior for the YOLOX training code as in Ubuntu. Also, the Stable Diffusion inference notebook is about 1it/s slower than previously.

I’m now wondering if I had something installed before I originally installed the extension that caused the different behavior.

Is that on a system with Intel or AMD CPU? @cjmills

An Intel i7-11700K, specifically.

1 Like

@cjmills Thanks! I’m trying to figure out if I can use intel_extension_for_pytorch with the A770 and an AMD CPU or if the Arc GPU and ipex need an Intel CPU. So far, I couldn’t find any answers.

How did the WSL2 test go? I saw that the Arc A-series discrete graphics family does not support GPU virtualization technology.
https://www.intel.com/content/www/us/en/support/articles/000093216/graphics/processor-graphics.html

Doesn’t that prevent the A770 from being available under WSL2 Ubuntu? @cjmills

1 Like

Wow, I didn’t expect such a huge performance gap. However, so virtualization actually works contrary to what Intel says? @cjmills

@cjmills Was your monitor connected to the UHD graphics in both cases?

I have not explored it beyond WSL.

That performance gap is not unique to the ARC card BTW, it’s an issue with WSL that has been around since I first tested WSL for deep learning projects in 2020.

For the best performance, I recommend native Linux, then native Windows, then WSL as the last option.

1 Like

Misread that, it was connected to the ARC card in all testing