Broken pipe: how to troubleshoot fastai?

jeffbiss · December 23, 2020, 8:58pm

I would like someone who understands fastai to address this because I am using Visual Studio Code on my computer and was able to run the code in 01_intro_ipynb on my computer until yesterday when I got this error (Traceback is below).

I would like to know if there’s a fastai resource that would provide information about such errors so that I could gain some understanding to troubleshoot something like this. In this case, the most recent call, I assume the call that actually failed, is way at the bottom but copied here for convenience:

—> 60 ForkingPickler(file, protocol).dump(obj)

Any suggestions about how to troubleshoot fastai would be greatly appreciated as this worked and just decided to stop. So, is there troubleshooting information available?

The Traceback is as follows:

BrokenPipeError Traceback (most recent call last)
in
11
12 learn = cnn_learner(dls, resnet34, metrics=error_rate)
—> 13 learn.fine_tune(1)

C:\Users\jbiss\AppData\Local\Programs\Python\Python36\lib\site-packages\fastai\callback\schedule.py in fine_tune(self, epochs, base_lr, freeze_epochs, lr_mult, pct_start, div, **kwargs)
155 “Fine tune with freeze for freeze_epochs then with unfreeze from epochs using discriminative LR”
156 self.freeze()
–> 157 self.fit_one_cycle(freeze_epochs, slice(base_lr), pct_start=0.99, **kwargs)
158 base_lr /= 2
159 self.unfreeze()

C:\Users\jbiss\AppData\Local\Programs\Python\Python36\lib\site-packages\fastai\callback\schedule.py in fit_one_cycle(self, n_epoch, lr_max, div, div_final, pct_start, wd, moms, cbs, reset_opt)
110 scheds = {‘lr’: combined_cos(pct_start, lr_max/div, lr_max, lr_max/div_final),
111 ‘mom’: combined_cos(pct_start, *(self.moms if moms is None else moms))}
–> 112 self.fit(n_epoch, cbs=ParamScheduler(scheds)+L(cbs), reset_opt=reset_opt, wd=wd)
113
114 # Cell

C:\Users\jbiss\AppData\Local\Programs\Python\Python36\lib\site-packages\fastai\learner.py in fit(self, n_epoch, lr, wd, cbs, reset_opt)
203 self.opt.set_hypers(lr=self.lr if lr is None else lr)
204 self.n_epoch = n_epoch
–> 205 self._with_events(self._do_fit, ‘fit’, CancelFitException, self._end_cleanup)
206
207 def _end_cleanup(self): self.dl,self.xb,self.yb,self.pred,self.loss = None,(None,),(None,),None,None

C:\Users\jbiss\AppData\Local\Programs\Python\Python36\lib\site-packages\fastai\learner.py in with_events(self, f, event_type, ex, final)
152
153 def with_events(self, f, event_type, ex, final=noop):
–> 154 try: self(f’before{event_type}’) ;f()
155 except ex: self(f’after_cancel{event_type}’)
156 finally: self(f’after_{event_type}’) ;final()

C:\Users\jbiss\AppData\Local\Programs\Python\Python36\lib\site-packages\fastai\learner.py in _do_fit(self)
194 for epoch in range(self.n_epoch):
195 self.epoch=epoch
–> 196 self._with_events(self._do_epoch, ‘epoch’, CancelEpochException)
197
198 def fit(self, n_epoch, lr=None, wd=None, cbs=None, reset_opt=False):

C:\Users\jbiss\AppData\Local\Programs\Python\Python36\lib\site-packages\fastai\learner.py in with_events(self, f, event_type, ex, final)
152
153 def with_events(self, f, event_type, ex, final=noop):
–> 154 try: self(f’before{event_type}’) ;f()
155 except ex: self(f’after_cancel{event_type}’)
156 finally: self(f’after_{event_type}’) ;final()

C:\Users\jbiss\AppData\Local\Programs\Python\Python36\lib\site-packages\fastai\learner.py in _do_epoch(self)
188
189 def _do_epoch(self):
–> 190 self._do_epoch_train()
191 self._do_epoch_validate()
192

C:\Users\jbiss\AppData\Local\Programs\Python\Python36\lib\site-packages\fastai\learner.py in _do_epoch_train(self)
180 def _do_epoch_train(self):
181 self.dl = self.dls.train
–> 182 self._with_events(self.all_batches, ‘train’, CancelTrainException)
183
184 def _do_epoch_validate(self, ds_idx=1, dl=None):

C:\Users\jbiss\AppData\Local\Programs\Python\Python36\lib\site-packages\fastai\learner.py in with_events(self, f, event_type, ex, final)
152
153 def with_events(self, f, event_type, ex, final=noop):
–> 154 try: self(f’before{event_type}’) ;f()
155 except ex: self(f’after_cancel{event_type}’)
156 finally: self(f’after_{event_type}’) ;final()

C:\Users\jbiss\AppData\Local\Programs\Python\Python36\lib\site-packages\fastai\learner.py in all_batches(self)
158 def all_batches(self):
159 self.n_iter = len(self.dl)
–> 160 for o in enumerate(self.dl): self.one_batch(*o)
161
162 def _do_one_batch(self):

C:\Users\jbiss\AppData\Local\Programs\Python\Python36\lib\site-packages\fastai\data\load.py in iter(self)
99 self.before_iter()
100 self.__idxs=self.get_idxs() # called in context of main process (not workers/subprocesses)
–> 101 for b in _loadersself.fake_l.num_workers==0:
102 if self.device is not None: b = to_device(b, self.device)
103 yield self.after_batch(b)

C:\Users\jbiss\AppData\Local\Programs\Python\Python36\lib\site-packages\torch\utils\data\dataloader.py in init(self, loader)
799 # before it starts, and del tries to join but will get:
800 # AssertionError: can only join a started process.
–> 801 w.start()
802 self._index_queues.append(index_queue)
803 self._workers.append(w)

C:\Users\jbiss\AppData\Local\Programs\Python\Python36\lib\multiprocessing\process.py in start(self)
103 ‘daemonic processes are not allowed to have children’
104 _cleanup()
–> 105 self._popen = self._Popen(self)
106 self._sentinel = self._popen.sentinel
107 _children.add(self)

C:\Users\jbiss\AppData\Local\Programs\Python\Python36\lib\multiprocessing\context.py in _Popen(process_obj)
221 @staticmethod
222 def _Popen(process_obj):
–> 223 return _default_context.get_context().Process._Popen(process_obj)
224
225 class DefaultContext(BaseContext):

C:\Users\jbiss\AppData\Local\Programs\Python\Python36\lib\multiprocessing\context.py in _Popen(process_obj)
320 def _Popen(process_obj):
321 from .popen_spawn_win32 import Popen
–> 322 return Popen(process_obj)
323
324 class SpawnContext(BaseContext):

C:\Users\jbiss\AppData\Local\Programs\Python\Python36\lib\multiprocessing\popen_spawn_win32.py in init(self, process_obj)
63 try:
64 reduction.dump(prep_data, to_child)
—> 65 reduction.dump(process_obj, to_child)
66 finally:
67 set_spawning_popen(None)

C:\Users\jbiss\AppData\Local\Programs\Python\Python36\lib\multiprocessing\reduction.py in dump(obj, file, protocol)
58 def dump(obj, file, protocol=None):
59 ‘’‘Replacement for pickle.dump() using ForkingPickler.’’’
—> 60 ForkingPickler(file, protocol).dump(obj)
61
62 #

BrokenPipeError: [Errno 32] Broken pipe

florianl · December 23, 2020, 9:32pm

Sorry i dont know whats wrong but fastai is not supported on Windows. The best way would be using WSL / linux.

jeffbiss · December 24, 2020, 9:56pm

Florian,

I didn’t know that, but the fastai notebook code ran successfully in my Windows 10 Visual Studio Code notebook until it didn’t! I’ll use my paperspace notebook from now on.

However, going forward and in a supported OS, when I read Learner.fine_tune I see nothing that implies pipes, so it seems that the problem is with Visual Studio Code or the Jupyter notebook and not the fastai code, correct?

Jeff

muellerzr · December 24, 2020, 10:21pm

When using windows with fastai (which is supported) you need to ensure you set num_workers to 0 in your DataLoaders as you can’t do multiprocessing at the moment. Can you try that and see if it fixes the pipe error?

jeffbiss · December 25, 2020, 5:37pm

Thanks for your reply. I don’t see num_workers in ImageDataLoaders.from_name_func so I tried it in the code as follows:

# CLICK ME
from fastai.vision.all import *
print(__name__)
if __name__ == '__main__':
    path = untar_data(URLs.PETS)/'images'
    num_workers=0
    def is_cat(x): return x[0].isupper()
    dls = ImageDataLoaders.from_name_func(
        path, get_image_files(path), valid_pct=0.2, seed=42,
        label_func=is_cat, item_tfms=Resize(224))

    learn = cnn_learner(dls, resnet34, metrics=error_rate)
    learn.fine_tune(1)

but this doesn’t resolve the problem.

I see this discussed in DataLoader but it is for DataLoader and not ImageLoaders as in the 01_intro.ipynb code. What am I just not understanding here? Is ImageLoaders a child of DataLoader that I somehow need to transfer the num_workers value to?

muellerzr · December 25, 2020, 5:53pm

dls = ImageDataLoaders.from_name_func(
        path, get_image_files(path), valid_pct=0.2, seed=42,
        label_func=is_cat, item_tfms=Resize(224), num_workers=0)

Try this ^

jeffbiss · December 25, 2020, 7:27pm

That did it! It seems so obvious now that you said to try that. I now assume that class ImageDataLoaders is a derived class of DataLoader as class DataLoader shows num_workers as a parameter and so all parameters shown there are applicable to ImageDataLoaders, correct?

Sorry for not thinking along those lines from the start and thanks for your time!

muellerzr · December 25, 2020, 9:15pm

Yes you are correct

dickens · March 20, 2021, 9:51am

hello,When I use your solution,the training speed is slow
|epoch|train_loss|valid_loss|error_rate|time|
|0|0.160340|0.023027|0.007442|09:57|
|epoch|train_loss|valid_loss|error_rate|time|
|0|0.053785|0.014080|0.004736|12:20|

my cpu:AMD RYDEN 3600
my gpu:RTX 2060
how can i reslove it? thank you!