Same thing.
The python/pytorch dedicates the GPU memory, but does not utilize the GPU itself. Instead - it uses CPU.
On CPU, the first lesson learn.fit_one_cycle(4) - one cycle took me about 1:30 with delays. If num_workers = 0 - 3 minutes.
What I’ve tested so far -
- Totally clear Windows 10.
- Latest Nvidia driver.
- Cuda 10.1 or 10.0.
- Different python env’s.
- num_workers = 0.
- torch.cuda.is_available() is true, but no difference.
The result is the same.
I have an Threadripper CPU and GTX 1080 Ti GPU.
Ended up installing Kubuntu alongside Windows. After all setup - one cycle is 17 seconds.
Totally worth it.