For anything more than 2 GPUs, PCIe lanes become an important factor.
Also I have experimented a bit with different numbers of GPUs on Google GCP, and seems anything more than 2 parallell GPUs seemed handicapped by the PCIe lane speed on V100 x8 GPUs. The NVlink topology was in a way that did not help to scale into 4x GPUs in parallel. Of course I could still use 4 separately running models on 8 GPUs with better speedup (4 models = each model/2 GPUs ).
I don’t know how AWS V100 NVlink topology is. But I gues there are better topologies can be tuned for DL training. Specifically, the DGX-2 topology seems better than GCP.
Here are my 3 posts on the details of GCP’s 8 GPU analysis.
I wonder how did you connect 6 GPUs. What MB, CPU and how many PCIe lanes do you have?