For those who run their own AI box, or want to

I tend to run fastai (+ other cuda related libs etc.) & pretty much all my projects via Docker containers (with nvidia containers support). I know that the course recommends NOT spending much time on “fighting” local setup, and if you’re starting out then I’d HIGHLY recommend following the advice.

However, if anybody wants to get any Docker + nvidia-container-toolkit help/discussion in terms of dockerfile/compose etc., then I can try to share what works for me (so far). Do note that I’m on a slighty non-standard linux distro (NixOS), but the majority of it should work on any Linux host when the cuda driver versions & libraries are matched.

Once again though, I would recommend NOT looking in this now unless you’re fairly comfortable with docker already. The options listed at the top of this thread will make you much more productive instead.

I’ve documented the steps on what works for me here. ( Docker + nvidia-container-toolkit + optionally NixOS)

8 Likes

I’ve been using Pop OS 22.04 (beta) for a while, as production machine. It will be automatically uplifted to full-fledged 22.04 within a few days.
Everything works (and worked) without any hassle. Nvidia drivers, miniconda, docker, etc…

Why Pop and not Ubuntu? Because it’s a bit more stable (just my personal experience, of course) and polished. Ubuntu flavors are all ok, but there are always minor glitches.

One additional point regards Alder Lake graphics. If you have 12-th gen Intel cpus like me and want to use the IGP so to leave the nvidia gpu(s) alone, you need kernels >5.16.
Ubuntu 22.04 adopts 5.15, while Pop gets 5.16
Of course you can always install an ubuntu mainline kernel (up to 5.18 RCs, from bare .debs) but then the burden of periodically updating the kernel will be upon you.

6 Likes

I took the sacrifice :sweat_smile:, I have installed ubuntu 22.04 LTS with windows 11 in dual boot. I have found many problems when installing Fastai using fastchan; then I made a new environment where I installed the following in this order:
1- PyTorch 1.10.2
2- pip install fastai
3- conda install Jupyter notebook

The order mattered a lot; now, everything is working correctly.

I am using the default GPU driver and CUDA installation with ubuntu for the GPU and CUDA. I didn’t have to install or update anything. (when you first launch ubuntu and type Nvidia-smi in the terminal, you will find that the driver is already installed version 510 and CUDA version 11.4).

With my 3080 GPU, all the learning models in the intro notebook from the fastbook worked fine. However, for the IMDB classification model, it was extremely slow relative to the other models, and I had to make the batch size (bs=16) to overcome the (CUDA out of memory) problem.

7 Likes

Thanks for sharing the details aboutnyour setup @falmerbid , I’m thinking of upgrading to 22.04 so this is really helpful.

I’d also be interested in your IMDB training times. I find that IMDB training is usually slower compared to to the other examples in that notebook. I also had to set the BS to 16. Later on I noticed I could get up to 24 on my 1070ti (8GB) … With fp16, I could go up to bs=32 IIRC.

1 Like

You will find below my training time for each epoch trained with a standard precision floating point… I need to learn more about the half-precision floating point (fp16) concept and its trade-offs

epoch train_loss valid_loss accuracy time
0 0.317577 0.330032 0.851960 04:56
1 0.244434 0.206753 0.918880 05:01
2 0.191029 0.183698 0.930560 04:58
3 0.153620 0.188571 0.931840 04:54
2 Likes

Thanks @falmerbid , I posted some of my fp16 numbers in this thread below:

2 Likes

Thanks so much for trying it out for all of us! :pray:

Maybe I’m wrong but I think you can auto-install drivers but not CUDA-has that changed recently? Did you have to configure anything seperately to make this work (while installing)?

2 Likes

Thank you Jeremy. The reason I install CUDA separetely still-I had read (IIRC it was Ross) had shared that installing CUDA separetely and compiling from source gives a slight better performance for the latest GPU architectures (Ampere, RTX 3000 series etc)

The other suggestion was to use NGC containers for most optimised speeds, I run my longer experiments inside the containers.

3 Likes

I have observed the same, we compared different approaches before ending up using NGC at jarvislabs.ai as it was the fastest.

3 Likes

That’s very interesting, I didn’t know that NGC containers would be fastest. Thanks Vishnu and Sanyam!

3 Likes

An interesting thing…

In my vanilla fastai conda env all the stuff works flawlessly. But in the fastbook environment (pip install fastbook), a warning about cuda capabilities is issued:

As a result, the GPU won’t be used:

Like I said above, this doesn’t happen with a plain fastai env.

Anyhow, as a reference (gpu capped at 250W, because it’s way more silent):
image

FP16:

image

Keep in mind that Nvidia uses ONLY docker containers upon its very, very expensive DGX systems. Of course you can work on the bare metal, but such a practice is discouraged by Nvidia (and if something doesn’t work as expected, their technical support will give you a reprimand :slight_smile:

4 Likes

Oh, it’s not only DGX, it can be run on any NGC certified servers which are available from makers like Dell, Asus, Supermicro, and more.

We have been using it for the last 2+ years and did not face such challenges, and we can modify them easily. The creator of timm has also been using NGC containers based on his tweets.

Maybe we just got lucky :smile:.

4 Likes

Of course. But you can run them on any machine whatsoever, not only NGC certified servers.
I cited DGX systems to highlight the fact that Nvidia itself, on its own systems, prefers not to work directly on the bare metal for AI (and other) appliances.

5 Likes

Oh got your point. Yes, you can run them only as containers.

2 Likes

@VishnuSubramanian I would love to team up with you (Or @balnazzar) on this since three of us have access to a bit of hardware.

I always wanted to benchmark in numbers for practical knowledge:

  • How does compiling PyTorch from source Vs Conda install affect perf
  • Same for NGC Vs Others
  • SATA SSD Vs M.2 Premium (Samsung) Vs M.2 cheaper ones (WD Black)'s effect on performance
  • Having GPUs on x4 Vs x8 Vs x16 lanes’ effect in performance

Please let me know if any of these are of interest to you but these are some questions that I wanted to check how much do they change things.

Ex: You should ideally always get an M.2 drive but what if you get a super good deal on a 4TB SATA SSD, what are you losing compared to a 512GB Samsung highest end M.2? :sweat_smile:

1 Like

I think we should be able to do these 2 easily.

  • NGC vs Pytorch vs any other containers.
  • NVME vs SSD

Regarding the others, we can discuss how to do them.

1 Like

Sure! :slight_smile:

I don’t have any SATA drive, but I got a bunch of nvme ones. Apart from the usual ‘speedy’ consumer drives in m.2 format, I got two Samsung Z-drives.
These units are quite interesting because they are SLC drives and beat any consumer drive in random data rate and random latency, and that’s what really matters in DL workloads (see below). I.e. they behave like something in between normal nvme ssds and DRAM modules.
It will be interesting to compare them with other consumer drives of mine, for example the 980 pro or the firecuda 530 (both in 2tb size), or with your big sata units.
We should take a bit of care in selecting the right workloads, tough. If you already have notebooks/datasets you used to benchmark your sata drives and that leverage a lot of little data points (typically NLP and tabular stuff) rather than a relatively small number of big data points (i.e. images), that would be ideal to see how such low-latency units behave.

1 Like

Other considerations:

I never compiled it from source, but there shouldn’t be any appreciable difference.

I did that back in 2018, and albeit I don’t remember the exact figures, the containerized environment was a bit faster than the bare metal. Nothing dramatic, tough.
The main advantage of docker, however, is that you don’t need to meddle with drivers/libraries at system level, and if you screw up something you won’t do any serious damage.
A conda env is something in between.

I actually did that, just out of curiosity, with two 3090s. I tried to set the slots in gen3 4x (as opposed to 16x) and it had shown little or no difference (but note that the gpus were nvlinked).
Consider that, having phased out my dual-3090 system, I now use only one gpu. I can benchmark by reducing both the number of lanes and the whether the slot should run at gen2/3/4 but it would be a lot less interesting w.r.t. a multi-gpu setup. For the record, Tim Dettmers found that with a single gpu, even going from 16x to 4x (gen3) amounted to 1-2% difference in performance.

One interesting benchmark you can perform @init_27, since you have the gpus, would be pitting one single a6000 against two 3090, of course selecting a batch size that exceeds the vram of a single 3090.
Also, try to lure nvidia into giving you a new 3090ti specimen :smiley: :wink:

Other stuff doable:

  • RAPIDS vs bare CPU (various kinds of preprocessing steps)
  • MKL vs DALI (augmentation)
2 Likes

Same here, still have a rather old devbox equipped with a couple of NVIDIA oldies, and they’re still enough even for rather heavy competitions, i.e., DICOM data segmentation.

The biggest challenge for me was (and still is, actually) to find a cloud setup with good I/O performance. I tried gcloud with x2-x4 GPUs V100/A100, but the throughput was ridiculous. Sometimes, it was slower (!) compared to running things on a dedicated devbox. I guess it was related to a too slow mounted storage. I mean, it is not a problem for smaller datasets, but for bigger ones it becomes a bottleneck. (Maybe an on-premise disk would help?)

I guess the question of “optimal” setup was raised multiple times on the forum. But still, I am a bit struggling with this topic of getting a good cloud setup with nice cost/performance balance. Like fast SSD/M.2 storage plus lots of VRAM which is not wasted due to slow read from a disk.

4 Likes