Just received my Intel Arc A770 GPU

Yeah, the cumulative performance hit of using WSL is one of the reasons why I stick to dual-booting on my desktop. That and the other headaches I’ve encountered.

@ cjmills
Awesome,

Thanks Christian.
Waiting for the tutorial.

Hi, thanks a loooooooooot for your work!!! I’m looking for this kind of comparison for several months.

I do not know if you updated your post because you found the tutorial, but here is a link, just in case.

It includes a link to the jupyter notebook with the modified training code.

Thanks a lot! Great testing for this series. Hope ipex 2.0 for xpu will work better!

I’ve been testing the latest release of Intel’s PyTorch extension on native Ubuntu and Windows, and I wanted to share my initial findings here before writing the blog post.

Native Ubuntu

First, I tested performance with the image classification notebook I used previously. Training time on Ubuntu was within six seconds of version 1.13.120+xpu of the extension. The final validation accuracy was identical.

Next, I tested the inference speed for Stable Diffusion 2.1 with the Hugging Face Diffusers notebook I used in this post. Inference speed when using bloat16 is approximately 25% faster than with the previous version of Intel’s PyTorch extension.

Using float16 has the same inference speed, but the model produces NaN values. The torch.compile() method seems to expect CUDA to be enabled, and the compiled model throws an error when I try to use it.

Last, I tried to run the training notebook for my recent YOLOX object detection tutorial. This notebook was the only one that did not work as expected. First, I had to replace some view operations in the loss function with reshape operations to handle non-contiguous data.

The training code ran with those changes, but the loss decreased much more slowly than on Nvidia GPUs and never reached usable performance. I tested inference performance with model checkpoints trained on my Nvidia GPU and got identical inference predictions, so the issue does not appear to be with the model itself. The training code also achieved usable accuracy when using the CPU, so it might just be a bug with the extension.

Training time was about 11 minutes for a single pass through the training set on the Arc GPU. For reference, the same takes about 2 minutes on an RTX 4090 (my Titan RTX died a bit ago).

I have not attempted to compile the extension from the source code to see if that provides different results.

Native Windows

Getting the extension to work on native Windows was a bit of a hassle, but the process is not too bad now that I know the steps. Most of the frustration came from not knowing I needed to disable the iGPU in Windows for the extension to find the Arc GPU.

Fortunately, those initial frustrations were worth it, as the extension works quite well on native Windows.

The total training time for the image classification notebook was slower than native Ubuntu but faster than WSL. That’s about as well as I could expect, given PyTorch on native Windows tends to be slower than Ubuntu, and Python multiprocessing takes longer to start on Windows.

I needed to replace the same view operations with reshape operations in the loss function for the YOLOX training code on Windows. However, this time, the notebook produced a model that was comparably accurate to one trained on Nvidia GPUs. I have no idea why the Windows version of the extension works when the Ubuntu version does not.

Total training time was a bit slower than native Ubuntu but still much faster than the free tier of Google Colab.

The Stable Diffusion inference notebook, also to my surprise, was about 25% faster than Ubuntu.

I’ll see how much I can streamline the setup process for Windows before making a tutorial. The oneAPI toolkit takes up quite a bit of space.

I might also try compiling the Ubuntu version to see if that resolves the issues with the YOLOX training code.

1 Like

Well, this is frustrating. I was about to wrap up my tutorial for setting up the extension on Windows and decided to test the installation steps by uninstalling everything and starting from scratch.

The installation process worked as expected, but now I get the same behavior for the YOLOX training code as in Ubuntu. Also, the Stable Diffusion inference notebook is about 1it/s slower than previously.

I’m now wondering if I had something installed before I originally installed the extension that caused the different behavior.

Is that on a system with Intel or AMD CPU? @cjmills

An Intel i7-11700K, specifically.

1 Like

@cjmills Thanks! I’m trying to figure out if I can use intel_extension_for_pytorch with the A770 and an AMD CPU or if the Arc GPU and ipex need an Intel CPU. So far, I couldn’t find any answers.

How did the WSL2 test go? I saw that the Arc A-series discrete graphics family does not support GPU virtualization technology.
https://www.intel.com/content/www/us/en/support/articles/000093216/graphics/processor-graphics.html

Doesn’t that prevent the A770 from being available under WSL2 Ubuntu? @cjmills

1 Like

Wow, I didn’t expect such a huge performance gap. However, so virtualization actually works contrary to what Intel says? @cjmills

@cjmills Was your monitor connected to the UHD graphics in both cases?

I have not explored it beyond WSL.

That performance gap is not unique to the ARC card BTW, it’s an issue with WSL that has been around since I first tested WSL for deep learning projects in 2020.

For the best performance, I recommend native Linux, then native Windows, then WSL as the last option.

1 Like

Misread that, it was connected to the ARC card in all testing

I’m late for the WSL game; the first time I tested it was last week. I think it’s super convenient. Too bad the performance suffers more than I was hoping for. I only have one machine right now, and dual boot isn’t really my preferred choice; I still rely on some Windows software.

In that case:

Do you think it would make a difference to connect the output to the integrated UHD instead and have the Arc GPU fully available as a compute device only?

Probably not for deep learning tasks as the card has dedicated hardware for tensor operations. Got to go now.