Pytorch v1.0 stable is now working on Windows but fastai v1 needs some tweaks to get it work on Windows

hwasiti · December 8, 2018, 2:33am

Pytorch v1.0.0 stable is released. Windows binaries are ready too. I have installed it on my Windows 10 machine.

Here is how to install pytorch 1.0.0 stable on Windows:

Create a new conda env:

conda create -n fastai-v1
conda activate fastai-v1

Install Pytorch v1.0.0 stable: choose Windows, your Python version and your Cuda version (for me Python 3.6, Cuda 9.2)
https://pytorch.org/get-started/locally/

pip is equavalent to pip3 in conda. So you can install torchvision by:
pip install torchvision

Install fasta-v1:
conda install -c fastai fastai
Clone fastai course-v3 from fastai repo (navigate to your working folder that you want to clone the notebooks to):

cd E:\projects
git clone https://github.com/fastai/course-v3.git

I tried lesson6-pets-more notebook and learn.fit_one_cycle(3, slice(1e-2), pct_start=0.8) gave me this error:

---------------------------------------------------------------------------PicklingError Traceback (most recent call last)
<ipython-input-14-e7e742ed4478> in <module>----> 1 learn.fit_one_cycle(3, slice(1e-2), pct_start=0.8)

~\Anaconda3\envs\fastai-v1\lib\site-packages\fastai\train.py in fit_one_cycle(learn, cyc_len, max_lr, moms, div_factor, pct_start, wd, callbacks, **kwargs) 18 callbacks.append(OneCycleScheduler(learn, max_lr, moms=moms, div_factor=div_factor,
19 pct_start=pct_start, **kwargs))
---> 20 learn.fit(cyc_len, max_lr, wd=wd, callbacks=callbacks) 21
22 def lr_find(learn:Learner, start_lr:Floats=1e-7, end_lr:Floats=10, num_it:int=100, stop_div:bool=True, **kwargs:Any):

~\Anaconda3\envs\fastai-v1\lib\site-packages\fastai\basic_train.py in fit(self, epochs, lr, wd, callbacks) 160 callbacks = [cb(self) for cb in self.callback_fns] + listify(callbacks) 161 fit(epochs, self.model, self.loss_func, opt=self.opt, data=self.data, metrics=self.metrics,
--> 162 callbacks=self.callbacks+callbacks)
163
164 def create_opt(self, lr:Floats, wd:Floats=0.)->None:

~\Anaconda3\envs\fastai-v1\lib\site-packages\fastai\basic_train.py in fit(epochs, model, loss_func, opt, data, callbacks, metrics) 92 except Exception as e: 93 exception = e
---> 94 raise e
95 finally: cb_handler.on_train_end(exception) 96

~\Anaconda3\envs\fastai-v1\lib\site-packages\fastai\basic_train.py in fit(epochs, model, loss_func, opt, data, callbacks, metrics) 80 cb_handler.on_epoch_begin() 81
---> 82 for xb,yb in progress_bar(data.train_dl, parent=pbar): 83 xb, yb = cb_handler.on_batch_begin(xb, yb) 84 loss = loss_batch(model, xb, yb, loss_func, opt, cb_handler)

~\Anaconda3\envs\fastai-v1\lib\site-packages\fastprogress\fastprogress.py in __iter__(self) 63 self.update(0) 64 try:---> 65 for i,o in enumerate(self._gen): 66 yield o
67 if self.auto_update: self.update(i+1)

~\Anaconda3\envs\fastai-v1\lib\site-packages\fastai\basic_data.py in __iter__(self) 46 "Process and returns items from `DataLoader`." 47 assert not self.skip_size1 or self.batch_size > 1, "Batch size cannot be one if skip_size1 is set to True"---> 48 for b in self.dl: 49 y = b[1][0] if is_listy(b[1]) else b[1] 50 if not self.skip_size1 or y.size(0) != 1: yield self.proc_batch(b)

~\Anaconda3\envs\fastai-v1\lib\site-packages\torch\utils\data\dataloader.py in __iter__(self) 817
818 def __iter__(self):--> 819 return _DataLoaderIter(self) 820
821 def __len__(self):

~\Anaconda3\envs\fastai-v1\lib\site-packages\torch\utils\data\dataloader.py in __init__(self, loader) 558 # before it starts, and __del__ tries to join but will get: 559 # AssertionError: can only join a started process.--> 560 w.start() 561 self.index_queues.append(index_queue) 562 self.workers.append(w)

~\Anaconda3\envs\fastai-v1\lib\multiprocessing\process.py in start(self) 110 'daemonic processes are not allowed to have children' 111 _cleanup()--> 112 self._popen = self._Popen(self) 113 self._sentinel = self._popen.sentinel
114 # Avoid a refcycle if the target function holds an indirect

~\Anaconda3\envs\fastai-v1\lib\multiprocessing\context.py in _Popen(process_obj) 221 @staticmethod
222 def _Popen(process_obj):--> 223 return _default_context.get_context().Process._Popen(process_obj) 224
225 class DefaultContext(BaseContext):

~\Anaconda3\envs\fastai-v1\lib\multiprocessing\context.py in _Popen(process_obj) 320 def _Popen(process_obj): 321 from .popen_spawn_win32 import Popen
--> 322 return Popen(process_obj) 323
324 class SpawnContext(BaseContext):

~\Anaconda3\envs\fastai-v1\lib\multiprocessing\popen_spawn_win32.py in __init__(self, process_obj) 63 try: 64 reduction.dump(prep_data, to_child)---> 65 reduction.dump(process_obj, to_child) 66 finally: 67 set_spawning_popen(None)

~\Anaconda3\envs\fastai-v1\lib\multiprocessing\reduction.py in dump(obj, file, protocol) 58 def dump(obj, file, protocol=None): 59 '''Replacement for pickle.dump() using ForkingPickler.'''---> 60 ForkingPickler(file, protocol).dump(obj) 61
62 #

PicklingError: Can't pickle <function crop_pad at 0x000001F60EEF3D08>: it's not the same object as fastai.vision.transform.crop_pad

bsalita · December 9, 2018, 9:01pm

PicklingError on Windows discussed here https://forums.fast.ai/t/pytorch-1-0-is-officially-released/32212

sgugger · December 9, 2018, 11:28pm

Windows support will be added in the next few weeks. We have a big change to make behind the scenes for the pickling error mentioned so we’ll focus on this after the last course. For now, there’s still no support of fastai v1 for Windows.

jeremy · December 10, 2018, 7:52pm

I’ve just pushed support for Windows that handles everything except for 3 transforms with num_workers>0:

pad
crop
crop_pad

(note that default get_transforms uses these, so you’ll need a custom transforms list.)

You’ll need github master for this BTW.

larcat · December 10, 2018, 8:11pm

You’re a boss

jeremy · December 10, 2018, 8:42pm

Thanks to @313V this limitation is now removed in master

balnazzar · December 11, 2018, 7:40pm

@hwasiti

Having a fastai v1 environment on Windows is great. You should be aware, however, that with v0.7 I always found the same notebooks to be executed more slowly on windows (a substantial 50-80%, indeed). The same stood for tensorflow/keras.

I don’t know why, maybe a matter of OS primitives.

hwasiti · December 13, 2018, 1:35am

I remember a tweet from Jeremy, that the performance on Windows was good with pytorch 0.3. Wait I will search for it.

Here it is:

larcat · December 13, 2018, 3:06pm

Anyone have a brief sketch of how to build the lib from source?

I’ve done it in R (my ‘native’ language), but never in Python/Anaconda…

Excited about this because I need to be on windows during the workday, and my GPU is basically sitting idle…

bsalita · December 20, 2018, 1:29pm

I’ve been testing lesson1-pets.ipynb on Windows 10 and comparing against Ubuntu 18.04. While Windows 10 seems to be giving the same results as Ubuntu. Unfortunately Windows is 5x slower than Ubuntu (60m vs 14m) with identical hardware (GPU) and software. CPU-only is impossibly slow on both systems. Windows monitors show that the hardware is way underutilized (20%) whereas Ubuntu is often 100%.

Is anyone else experiencing 5x slowness? Is it due to pickle or other lib not running multi-threaded?

Bronzi88 · December 29, 2018, 5:05pm

I´m able to reproduce the slowness compared to Ubuntu on my machine too (10:12 vs 1:54). Most of the time is consumed between two epochs or before validation starts. So it is either the data preprocessing or initially transferring the data to the GPU. In the next days, I will test if Tensorflow has the same reduction in speed compared to Ubuntu.

Interogativ · January 15, 2019, 8:16pm

I was able to confirm this as well, it should be noted that to run the lesson1-pets notebook, the regular expression for parsing the filenames must be changed from:
pat = re.compile(r’/([^/]+)_\d+.jpg$’)`

to

pat = re.compile(r'\\([^\\]+)_\d+.jpg$')

because linux uses slashes to separate file path levels and windows uses backslashes.

Interogativ · January 16, 2019, 11:18pm

not sure if this: in windows, DataLoader with num_workers > 0 is extremely slow (50 times slower) is the reason, but it would appear it’s about process creation.

bsalita · January 18, 2019, 9:23am

Agreeing but without looking at the code. The solution will be a vastly improved process/thread lifecycle manager with changes to process/thread communications, similar to a service architecture.

sgtpepper · January 27, 2019, 3:26am

I was able to install fastai v1 and pytorch on Windows 10 using conda. I’m using python 3.7.1. However, when I run through lesson1-pets I get the following error:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-12-15e5d1d9602d> in <module>
----> 1 data = ImageDataBunch.from_name_re(path_img, fnames, pat, ds_tfms=get_transforms(), size=224, bs=bs
      2                                   ).normalize(imagenet_stats)

C:\anaconda\envs\v3\lib\site-packages\fastai\vision\data.py in from_name_re(cls, path, fnames, pat, valid_pct, **kwargs)
    153         pat = re.compile(pat)
    154         def _get_label(fn): return pat.search(str(fn)).group(1)
--> 155         return cls.from_name_func(path, fnames, _get_label, valid_pct=valid_pct, **kwargs)
    156 
    157     @staticmethod

C:\anaconda\envs\v3\lib\site-packages\fastai\vision\data.py in from_name_func(cls, path, fnames, label_func, valid_pct, **kwargs)
    146         "Create from list of `fnames` in `path` with `label_func`."
    147         src = ImageItemList(fnames, path=path).random_split_by_pct(valid_pct)
--> 148         return cls.create_from_ll(src.label_from_func(label_func), **kwargs)
    149 
    150     @classmethod

C:\anaconda\envs\v3\lib\site-packages\fastai\data_block.py in _inner(*args, **kwargs)
    386         assert isinstance(fv, Callable)
    387         def _inner(*args, **kwargs):
--> 388             self.train = ft(*args, **kwargs)
    389             assert isinstance(self.train, LabelList)
    390             kwargs['label_cls'] = self.train.y.__class__

C:\anaconda\envs\v3\lib\site-packages\fastai\data_block.py in label_from_func(self, func, **kwargs)
    240     def label_from_func(self, func:Callable, **kwargs)->'LabelList':
    241         "Apply `func` to every input to get its label."
--> 242         return self.label_from_list([func(o) for o in self.items], **kwargs)
    243 
    244     def label_from_folder(self, **kwargs)->'LabelList':

C:\anaconda\envs\v3\lib\site-packages\fastai\data_block.py in <listcomp>(.0)
    240     def label_from_func(self, func:Callable, **kwargs)->'LabelList':
    241         "Apply `func` to every input to get its label."
--> 242         return self.label_from_list([func(o) for o in self.items], **kwargs)
    243 
    244     def label_from_folder(self, **kwargs)->'LabelList':

C:\anaconda\envs\v3\lib\site-packages\fastai\vision\data.py in _get_label(fn)
    152         "Create from list of `fnames` in `path` with re expression `pat`."
    153         pat = re.compile(pat)
--> 154         def _get_label(fn): return pat.search(str(fn)).group(1)
    155         return cls.from_name_func(path, fnames, _get_label, valid_pct=valid_pct, **kwargs)
    156 

AttributeError: 'NoneType' object has no attribute 'group'

Interogativ · January 27, 2019, 6:17pm

See my post above about the regular expression used to parse the filenames

mgorecki · January 27, 2019, 10:27pm

I’ve completed the lesson one on Windows 10 (cuda 10, gtx 1070), but the performance is indeed very poor. Most of the time is spent on CPU. I run a profiler and the most of the time is spent in grid_sampler. The GPU usage is very low.

command:
%prun learn.fit_one_cycle(2, max_lr=slice(1e-6,1e-4))

result:
ncalls tottime percall cumtime percall filename:lineno(function)
14732 55.815 0.004 55.815 0.004 {built-in method grid_sampler}
32074 41.458 0.001 41.458 0.001 {method ‘decode’ of ‘ImagingDecoder’ objects}
234 37.108 0.159 37.108 0.159 {method ‘cpu’ of ‘torch._C._TensorBase’ objects}
14732 29.377 0.002 29.377 0.002 {method ‘clone’ of ‘torch._C.TensorBase’ objects}
20436 17.061 0.001 17.061 0.001 {built-in method addmm}
11010 15.233 0.001 15.233 0.001 {method 'sigmoid’ of ‘torch._C._TensorBase’ objects}
29696 12.904 0.000 12.904 0.000 {method ‘contiguous’ of ‘torch._C.TensorBase’ objects}
49653 12.548 0.000 12.548 0.000 {method 'mul’ of ‘torch._C._TensorBase’ objects}
14732 10.539 0.001 10.539 0.001 {method ‘astype’ of ‘numpy.ndarray’ objects}
941 7.764 0.008 7.764 0.008 {built-in method torch._C._nn.adaptive_avg_pool2d}
14732 7.158 0.000 8.627 0.001 image.py:517(affine_grid)
30006 5.989 0.000 5.989 0.000 {method 'add’ of ‘torch._C.TensorBase’ objects}
10672 3.987 0.000 3.987 0.000 {method 'zero’ of ‘torch._C._TensorBase’ objects}
55206 3.357 0.000 104.106 0.002 image.py:116(refresh)
118 3.313 0.028 3.313 0.028 {built-in method stack}
14732 2.889 0.000 2.889 0.000 {method ‘max’ of ‘torch._C._TensorBase’ objects}
14732 2.883 0.000 2.883 0.000 {method ‘min’ of ‘torch._C._TensorBase’ objects}
14732 2.773 0.000 255.430 0.017 data_block.py:481(getitem)

KarlH · January 28, 2019, 2:44am

CPU usage is likely loading and augmenting the images.

As for the GPU usage, did you run this after unfreezing the model? If the model is still frozen, you’re only training the last linear layer, which is computationally light. Try unfreeze the model and increase the batch size to the largest your GPU can hold.

mgorecki · January 28, 2019, 9:33am

I tried both locked and unlocked model and the behaviour is consistent - about 35-40% cpu usage (i7-8700K) and some spikes of GPU usage. Input data are on SSD.

I’m getting results consistent with previous comments. It takes over 10 minutes, not 1.5min as on the video. Adjusting batch size doesn’t change much.

mgorecki · January 28, 2019, 11:17am

It seems the slowdown is a result of running the data loader with num_workers=0. GPU usage goes up when resnet50 is used. I’ve spent long time googling it and it seems we have to live with it until pytorch is fixed for Windows platform.