Recommendations on new 2 x RTX 3090 setup

SimonBiggs · August 20, 2022, 2:52am

I like your thinking, but unfortunately in this case all of the GPU RAM is currently occupied by the model being trained. And I am currently almost maxing out my 128 GB motherboard RAM during the pipeline processing as is…

matdmiller · August 20, 2022, 6:33am

I would really try and optimize your pipeline before building another system. If more CPU is truly required you’d be better off putting your GPU’s in that new system rather than trying to create a distributed system solely for the purpose of offloading the CPU dataloading workload. If you can share your code and some info about your dataset on the forums, there are probably people who can help give you pointers on how to optimize it.

Are you currently reading your dataset solely off of your NVME drive or HDD because it’s too big? Are you proposing raid for speed or just additional storage capacity? I assume your files are fairly large which should be optimal for achieving high throughput from your storage. I would be surprised if disk throughput is your problem more than potentially the deserialization of the file. For example with jpg’s deserialization of the image can be a significant bottleneck and is much more time consuming than the actual initial load of the file itself from disk to memory.

What format? Do you mean efficient as in reduced storage space (compressed) or efficient as in the file is efficient (fast) to deserialize? Decompressing highly compressed files can definitely be a bottleneck.

If you’re loading in 67M pixels (or voxels?) per input image and reducing that down to 0.26M before feeding the data into the GPU that’s a 256x reduction in size. I would definitely look at pre-processing the initial files into a more optimized dataset for your dataloader if at all possible. I’d need to see examples of your data loading pipeline to give specific suggestions, but if for example if you’re always downsampling by 2x, then go ahead and create a separate downsampled version of your dataset and use that so you don’t have to downsample on the fly every time which would be very wasteful (and slow) from a compute perspective. That would also dramatically improve your system memory footprint.

SimonBiggs · August 20, 2022, 10:51am

Yup, that’s very fair from the information I have provided. For each raw dataset item I am creating four subsequent patches at increasing resolutions.

The first dataset is not undergoing any rescaling, and is just cropped at full res and is 64x64x64. The second patch crops to 128x128x64 and rescaled to 64x64x64. The third crops to 256x256x128 and rescaled to 64x64x64 and the fourth re-scales from the full image. The network then utilises each of these scales to “upscale” to the next patch resolution.

So I do need a combination of the full res image and the lower res image.

I’ll give more info later, and yes, there is very likely room to solve this through further caching.

SimonBiggs · August 24, 2022, 1:53pm

Hi @matdmiller,

More clever caching did the trick . Thanks for steering me clear of a silly idea.

Cheers,
Simon

bnascimento · September 9, 2022, 4:11pm

Hi everyone,

I also have the opportunity to build a system around 2x RTX 3090 Ti Suprim X.
I have seen a few setups here documented, namely from atao balnazzar shreeyak SimonBiggs , and I still have doubts about what hardware should I pick.
What case, mobo and psu would you been choosing today? I feel the GPUs are massive, and to have them in a disposition like
GPU GPU GPU EMPTY GPU GPU GPU
or
GPU GPU GPU EMPTY GPU GPU GPU EMPTY
to allow a good air circulation and directly attached to the motherboard, with any extra paraphernalia like PCIE risers l is a real challenge (or I am missing something here?)
Our point is also to test NVLINK bridge. What type should I choose for those 2 GPUs? 3 or 4 slot bridge?

Cheers

troman25 · September 30, 2022, 5:56pm

Hello,
Ill keep it short:
I have 2 Gigabyte 3090 OC. It is very hard to find any good waterblocks for it so I am stuck with air cooling them.
I wanted to build a PC that will have 1 vertically mounted and the other horizontal.
I don’t have budget for a CPU with a lot of lanes so im will get the new Intel gen 13 CPUs (the 16/20 core one).

Question is:

Is it good to do the one horizontal and one vertical set up?
if yes, what case to get?

Thanks!

atao · October 1, 2022, 2:37pm

@bnascimento My experience with my dual 3090 system:

It’s pretty critical that you have a motherboard that allows 4-slot spacing so that the GPU in the middle that gets hot air blown on it has some space to pull in cool air. For me the best !/$ was ASUS X299 Sage but that’s probably outdated now. I strongly favored many PCI slots over some PCI extender thing. Just simpler.
For me personally, having a fullsize case was really really useful to provide space to pull in more air (more fans) and to just work inside the system.
I got an AIO cooler to provide more space for air to flow around. It’s been really great.
In deep learning, you’re often dataloader limited, so get a fast CPU. I have a CPU with 24 threads (i9 10940) so 12 threads/GPU, which is good. Having 8 worker threads per GPU is a good baseline. I’m sure there are much better choices than my i9 today, but i’m really glad i got a workstation-class CPU, it’s been bulletproof running multiple simultaneous workloads.
Don’t cheap out on memory. I got 64GB and am glad since you often will want to run training + inference simultaneously and other stuff. Running ffmpeg in the shadow of training jobs can eat up memory quickly.

matdmiller · October 2, 2022, 1:22am

I used this case in a build last year and was pretty happy with because it’s large which makes it easy to work in and the backside (towards the front), bottom and top are available for fans. Lambda labs uses this case for 3x 3090 air cooled builds. I am using 2x 3090 hybrid cards and a hybrid cooler for the cpu so I had to 3D print a front panel to be able to mount all of my radiators and still have the bottom available for air intake. Putting radiators below the cards is bad practice anyways because air can get trapped in the pump if the radiator is below the pump. PC-O11D XL - E-ATX ,ATX Full To wer Gaming Computer Case

You definitely want at least 1 slot between your cards and they will run hotter regardless, but it should work. If you can get 1 (or preferably more) slot spacing between your cards (if the motherboard allows it) then I would probably not worry about vertical mounting as it will add more complexity and cost without a lot of benefit. If you cannot get at least 1 slot between your cards then vertical mounting 1 is probably needed to keep your cards from over heating.

This is my build I’m referring to. Sorry the picture quality is kind of crappy. I am using 2x 2 slot hybrid (water) cooled cards + an air cooled 1080ti at the bottom. This motherboard has 7x 16x pcie4 slots so there was plenty of room between the cards. As you can see this case is quite large and you can buy a vertical mount gpu bracket if required.

troman25 · October 2, 2022, 6:02am

@matdmiller
Thank you for the response!
Thing is I am worried about water cooling the GPUs since it is my first time doing it. Water cooling in general.
How is the maintenance of water cooled dual RTX 3090 is like? Is it manageable? And also is it safe to water cool GPUs for years? Since I am planning on keeping this set up for many years, so longevity is a factor here.

matdmiller · October 2, 2022, 9:03am

I am using ‘hybrid’ cards which come fully assembled with flexible hoses and radiators sealed from the factory. These cards are much easier to set up and are probably less prone to leaks than a custom water loop solution. They require zero maintenance. Since you already have your cards that are air cooled I doubt it’s worth trying to convert them and I would expect that with sufficient spacing and case airflow that you will be fine. If it were me and I already had air cooled cards I would not try and convert them to being water cooled, I would just pick a motherboard that will allow 4 (or more) slot spacing (2 empty slots between the cards) and pick a case with good airflow.

The reason I went with hybrid cards is that I think it’s easier to keep them cool, they generally run cooler overall than air cooled cards (when you have multiple cards at least), they take up less space (2 slots instead of 3+ for the 3090) so you can pack more in and/or have a little more flexibility in motherboard selection because the cards are smaller and don’t take up as much space (slot spacing). I recently took my ~5 year old 1080ti hybrid out of service and replaced it with a 3090 and never had any issues with leaks. The only issue I did have was overheating at one point which I fixed by re-pasting but I do not believe this was due to it being a hybrid card. I’m guessing the paste going bad was just due to a lot of hot/cold cycles and age.

tyoc213 · October 12, 2022, 1:54am

Hi there guys, I was wondering if it is possible to mix two types of cards say:

2080 and a 3090.
3080 and a 3090.
3090 and a 4090.
2080 and a 4090.

Or cards should be twins?

matdmiller · October 12, 2022, 2:41am

If you’re trying to use both GPU’s to train a single model in parallel then it’s best if they match. If you want to run separate experiments on each independently then mixing GPU’s is totally fine. Training separate models on each GPU independently is what most people do and is generally recommended, though there are some cases where using 2 GPU’s to train a single model are warranted.

tyoc213 · October 12, 2022, 6:44am

Well, Im more interested in learning/using in particular DistributedDataParallel and maybe other ways to do parallel training (not sure on how many ways there are) but I dont think I will have 2 GPUs that matchs.

danteoz · October 12, 2022, 2:52pm

You’ll be able to use DistributedDataParallel, but you’ll be limited to performance of your least powerful gpu.

balnazzar · October 12, 2022, 5:33pm

SimonBiggs:

Hi @balnazzar,

Can I just say the machine I built with your help has been absolutely humming along beautifully. I’ve been using it to do 3D segmentation of medical images to help in radiotherapy cancer treatment. Thank you!

My current workflow is hitting a CPU bottleneck (with all 24 pipeline processes maxed out at near 100%). Resulting in an approximate 3:1 ratio of GPU idle to full utilisation. (So a factor of 4 training speed reduction).

For a little bit of background, I am taking the 512x512x256 images and 512x512x256x40 segmentation encodings and scaling, cropping and augmenting down to training patches of 64x64x64 and 64x64x64x40 respectively with a batch size of 8 (4 per GPU). I have been using the multiprocessing toolbox so as to not hit the GIL and I have been using eliot to help flag hotspots for pipeline optimisation. The time costs right now are roughly split half and half between actually loading the efficiently cached representation from disk and then the call to skimage.measure.block_reduce. I haven’t yet looked into seeing if using something like JAX’s CPU module might solve the second half of this problem. I might also be able to undergo more clever caching to help with the first half.

Nevertheless, I had another thought, and I’d be keen for your opinion on whether or not it is a crazy idea. What if I built a separate threadripper server (or another high thread count CPU type) + raided NVMe for faster disk read that is solely dedicated to raw data scaling, cropping, and augmentation and then send the resulting training inputs over the wire. Is this a crazy idea?

Here are my back of the envelope numbers to make sure the ethernet connection won’t become the new bottleneck:

8x64x64x64 @ float32

8x64x64x64x40 @ uint8 (not one hot encoded, instead edge pixels are encoded such that the number between 0-255 details how much of a given voxel is within the segmentation)

The 2 GPUs currently run at full capacity for a little over 2 seconds running the training step, before subsequently waiting for the next pipeline item.

Therefore every 2 seconds I need the following amount of data to come through the connection:

(8*64**3*32 + 8*64**3*40*8) / 1024**3 Gbit / 2 seconds

= 0.7 Gbit / 2 seconds

Need > ~1/2 Gbit/s connection (✓)

So it doesn’t look like the ethernet would become the new bottleneck…

Is this idea of creating a standalone threadripper (or similar) + raided NVMe pipeline PC a good idea in your opinion? Or am I thinking something quite silly?

If a pipeline PC like this might be a good idea, do you have any recommendations for a build?

Thank you again! So very much

Hi @SimonBiggs , Thanks for your kind words. I missed your message probably since it was posted back in the middle of Aug when I was on vacation.

I can offer generic advice, since it’s always a tough call to diagnose bottlenecks without experimenting 1st-person with them…

If you work with images, bear in mind that Intel CPUs still have substantial advantage due to their MKL optimizations. AMD is closing the gap but it’s not yet quite there.
On Ebay, you’ll find plenty of 24c/32c 2nd gen scalable Xeons sold at exceptionally low prices. I bought a 24c 3.1 GHz two years ago for 350 eur. Big cache, 48 threads, support for 2 Tb ECC rdimm memory.
A modern NVME disk can output 7 Gb/s, sequential. In no AI-related usage case whatsoever you will fill that bandwidth, period.
Another, totally different side of the question is the random performance of these ssds. They are pretty weak at random access. If you can get 70 Mb/s (yes, megabytes) at 4KQ1T1, you are lucky. Generally, you stay below 50 Mb/s.
Optane-class drives (SLC), on the other hand, shine exactly at this (hence their price premium). I use two Samsung Z-Drives, but currently the 5800X by Intel is the fastest, baddest, meanest ssd on Earth.

Hope this helps.

dva · October 22, 2022, 12:44am

Seems like these days, a researcher could potentially purchase either:

Used or discounted RTX a6000 ~$3500, ~300w
Two RTX 3090’s ~$2000 ,~700w, ~1.8-2x faster than a single a6000
One 4090 and undervolt ~$1600 ~450w, ~1.3-1.4x faster than 3090

The total watts/training run using an under-volted 4090 may be the ideal low-cost/low cooling effort solution for doing development work before scaling up, considering the cooling on these is much more substantial vs. the 3000 series.

balnazzar · October 31, 2022, 5:29pm

If, and only if, 24Gb VRAM is sufficient for your present and near-future usage scenarios.

The real cool thing about the A6000 is not the speed, but the huge VRAM amount (without the PITA of parallelizing two 3090/4090, which is not even always feasible).

dva · October 31, 2022, 6:00pm

Yeah I hear you. I’m still struggling to justify the initial $4k in outflow of cash. And I’m honestly not super familiar with this solution, but is it viable to use a package such as GitHub - tensorflow/mesh: Mesh TensorFlow: Model Parallelism Made Easier to scale up to another 4090 at a future date if you need the pooled 48gb of memory? Seems like model parallelism is becoming more standardized.

Spitfirus · November 2, 2022, 5:59pm

I’m currently looking at used (mining or other) 3090ti’s for a dual(+) setup. Reason being the pooling option with NVLink… from what I understood, it’s possible to have an effective 48gb from 2x 3090’s … but same cannot be done anymore on the 4090’s since they dropped NVLink with this new gen.

Have a I blundered in this area of understanding…or? What are your thoughts?

dva · November 2, 2022, 6:32pm

It’s my understanding that nvlink never pooled memory for deep learning directly. I think the criticism with nvidia dropping nvlink is mainly from graphics professionals that were using nvlink to pool 3090 memory for certain rendering programs.

In your example, I think you would see two 3090ti’s - they would probably be able to communicate faster over nvlink vs. pcie 4.0 - but you would still see two separate 24gb gpus. If you want to use their combined memory for training a larger model, then you would have to find something to use to help with that (ZeRO, colossal-ai) or code your own solution. Here’s a good link: Efficient Training on Multiple GPUs