Training a Language Model is a a lot lot slower in v1 (must be a stupid mistake)

Hi,
I’m currently training a Language Model from Wikipedia with fast.ai v1. I’m using 50k tokens, just like I did with v0.7 and everything else is as similar as I could make it. The main difference is I’m using a bs of 64 and previously used 32 but that should only speed things up, not make it slower. One epoch takes an estimated 15h whereas before it took about 2h40m. I feel like I must miss something very obvious so any help is appreciated. I’ve checked that the GPU is actually being used.


torch.cuda.current_device()

0

torch.cuda.device(0)

<torch.cuda.device at 0x2aa1d1857b8>

torch.cuda.device_count()

1

torch.cuda.get_device_name(0)

'Quadro P6000'

Tue Mar 12 14:33:05 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.96       Driver Version: 418.96       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name            TCC/WDDM | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Quadro P6000       WDDM  | 00000000:65:00.0  On |                  Off |
| 36%   71C    P0    78W / 250W |   6376MiB / 24576MiB |     83%      Default |
+-------------------------------+----------------------+----------------------+

Edit: here’s the nvidia-smi dmon output during training (sm goes up to 90ish so everything seems ok):

# gpu   pwr gtemp mtemp    sm   mem   enc   dec  mclk  pclk
# Idx     W     C     C     %     %     %     %   MHz   MHz
    0   207    52     -    36    24     0     0  4513  1569
    0   220    51     -    66    43     0     0  4513  1632
    0    68    52     -    63    43     0     0  4513  1569
    0   185    53     -    43    27     0     0  4513  1569
    0   183    53     -    68    47     0     0  4513  1645
    0    69    54     -    52    39     0     0  4513  1569
    0   209    53     -    53    32     0     0  4513  1632
    0   135    53     -    78    48     0     0  4513  1544
    0   156    54     -    36    26     0     0  4513  1544
    0   136    54     -    74    50     0     0  4513  1632
    0    67    54     -    54    39     0     0  4513  1544
    0   203    55     -    41    22     0     0  4513  1556
    0   179    55     -    37    27     0     0  4513  1556

It’s also >15h on another machine (GTX 1080). I’ve attached the notebook, in case there’s a superobvious error I’m missing. I also changed the chunksize but that should have no influence (imo).
I’ve attached the notebook (change from .pdf to .ipynb to view it as that extension wasn’t allowed)

lm_de.pdf (54.0 KB)