Method 'dump' of '_pickle.Pickler' uses 80%+ of runtime in v1

I noticed extremely slow run times on my local machine (3+minutes) versus if I run the same notebook on Colab (7sec). This is running the example tabular.ipynb model (https://github.com/fastai/fastai/blob/master/examples/tabular.ipynb).

I exported the code to a script and profiled with cProfile, and it appears something weird is going on with Pickler as it accounts for over 80% of the total run time, and the total runtime is mostly from the learn.fit() method:

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
       32  176.021    5.501  176.036    5.501 {method 'dump' of '_pickle.Pickler' objects}
      131   13.982    0.107   13.983    0.107 {built-in method _imp.create_dynamic}
     5360    5.150    0.001    5.150    0.001 {built-in method nt.stat}
      505    2.435    0.005    2.435    0.005 {method 'run_backward' of 'torch._C._EngineBase' objects}
     1555    2.264    0.001    2.715    0.002 {method 'to' of 'torch._C._TensorBase' objects}
     1039    1.941    0.002    2.091    0.002 <frozen importlib._bootstrap_external>:830(get_data)
       73    1.736    0.024    1.736    0.024 {built-in method _winapi.WaitForSingleObject}
    28792    1.552    0.000    1.552    0.000 {method 'mul_' of 'torch._C._TensorBase' objects}
     1527    1.182    0.001    1.182    0.001 {built-in method batch_norm}
    19197    0.954    0.000    0.954    0.000 {method 'add_' of 'torch._C._TensorBase' objects}

Has anyone else experienced similarly slow training times due to the pickler?

Here is my environment (win 10 with quadro m1000m gpu):
beautifulsoup4==4.7.1
Bottleneck==1.2.1
certifi==2019.3.9
chardet==3.0.4
cycler==0.10.0
cymem==2.0.2
cytoolz==0.9.0.1
dataclasses==0.6
dill==0.2.9
fastai==1.0.48
fastprogress==0.1.20
idna==2.8
kiwisolver==1.0.1
matplotlib==3.0.3
msgpack==0.5.6
msgpack-numpy==0.4.3.2
murmurhash==1.0.2
numexpr==2.6.9
numpy==1.16.2
nvidia-ml-py3==7.352.0
packaging==19.0
pandas==0.24.1
Pillow==5.4.1
plac==0.9.6
preshed==2.0.1
pyparsing==2.3.1
python-dateutil==2.8.0
pytz==2018.9
PyYAML==5.1
regex==2018.1.10
requests==2.21.0
scipy==1.2.1
six==1.12.0
soupsieve==1.8
spacy==2.0.18
thinc==6.12.1
toolz==0.9.0
torch==1.0.1
torchvision==0.2.2.post3
tqdm==4.31.1
typing==3.6.6
ujson==1.35
urllib3==1.24.1
wincertstore==0.2
wrapt==1.10.11

Windows is the problem. I also had very slow runtimes on Windows 10. Changing to Ubuntu 18.04 I could increase up to a factor of 25, which also fits your numbers.

Main problem of Windows is the file system, which can’t handle spawning of many processes. You can set the number of workers to 1, so that no parallel processing is done on CPU side, this helps to at least speed up. Nevertheless, Ubuntu is on average still 4 times faster with no parallel workers on Windows. And it seems that this is not easily fixable :frowning: