Lesson 1 training after data augmentation seems extremely slow on threadripper. could someone compare training times?

Hey, thanks a lot

I was able to debug this further and it seems to be due to the use of MKL in the docker build i took from here with this dockerfile

What I found is that just having MKL installed will use 32 “threads”. I air quote that because a single MKL thread will somehow use 8 threadripper cores. So the default of 32 threads is causing that congestion.

At least that’s the behavior I observed because if I set it manually to 1 thread via mkl.set_num_threads(1) (from https://docs.anaconda.com/mkl-service/) it uses 8 logical threadripper cores which also brings it down to your 47 seconds per epoch. 2 threads => 16 cores is slowing it down slightly as well

Confusing. The performance with MKL is still better than without. But I have no idea where that 1 thread => 8 core thing comes from or even better, how to predict this. Do you have MKL enabled? I guess I could remove it completely. But it does bring a small speed gain

Using cuda 10 ubuntu 18.04