Runtime Error While running prediction on planet dataset

aymenim · November 16, 2017, 6:35pm

I am getting the following runtime error while running the learn.TTA(is_test=True), the only thing I am doing different from the lesson2 notebook is loading the model using

learn.set_data(get_data(sz))
learn.load('256')

the stack trace

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-30-e855bad3db73> in <module>()
----> 1 log_preds, y = learn.TTA(is_test=True)

~/DL/fastai/courses/dl1/fastai/learner.py in TTA(self, n_aug, is_test)
    148         dl1 = self.data.test_dl     if is_test else self.data.val_dl
    149         dl2 = self.data.test_aug_dl if is_test else self.data.aug_dl
--> 150         preds1,targs = predict_with_targs(self.model, dl1)
    151         preds1 = [preds1]*math.ceil(n_aug/4)
    152         preds2 = [predict_with_targs(self.model, dl2)[0] for i in range(n_aug)]

~/DL/fastai/courses/dl1/fastai/model.py in predict_with_targs(m, dl)
    115     if hasattr(m, 'reset'): m.reset()
    116     preda,targa = zip(*[(get_prediction(m(*VV(x))),y)
--> 117                         for *x,y in iter(dl)])
    118     return to_np(torch.cat(preda)), to_np(torch.cat(targa))
    119 

~/DL/fastai/courses/dl1/fastai/model.py in <listcomp>(.0)
    114     m.eval()
    115     if hasattr(m, 'reset'): m.reset()
--> 116     preda,targa = zip(*[(get_prediction(m(*VV(x))),y)
    117                         for *x,y in iter(dl)])
    118     return to_np(torch.cat(preda)), to_np(torch.cat(targa))

~/DL/fastai/courses/dl1/fastai/dataset.py in __next__(self)
    226         if self.i>=len(self.dl): raise StopIteration
    227         self.i+=1
--> 228         return next(self.it)
    229 
    230     @property

~/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py in __next__(self)
    193         while True:
    194             assert (not self.shutdown and self.batches_outstanding > 0)
--> 195             idx, batch = self.data_queue.get()
    196             self.batches_outstanding -= 1
    197             if idx != self.rcvd_idx:

~/anaconda3/lib/python3.6/multiprocessing/queues.py in get(self)
    335             res = self._reader.recv_bytes()
    336         # unserialize the data after having released the lock
--> 337         return _ForkingPickler.loads(res)
    338 
    339     def put(self, obj):

~/anaconda3/lib/python3.6/site-packages/torch/multiprocessing/reductions.py in rebuild_storage_fd(cls, df, size)
     68         fd = multiprocessing.reduction.rebuild_handle(df)
     69     else:
---> 70         fd = df.detach()
     71     try:
     72         storage = storage_from_cache(cls, fd_id(fd))

~/anaconda3/lib/python3.6/multiprocessing/resource_sharer.py in detach(self)
     56             '''Get the fd.  This should only be called once.'''
     57             with _resource_sharer.get_connection(self._id) as conn:
---> 58                 return reduction.recv_handle(conn)
     59 
     60 

~/anaconda3/lib/python3.6/multiprocessing/reduction.py in recv_handle(conn)
    180         '''Receive a handle over a local connection.'''
    181         with socket.fromfd(conn.fileno(), socket.AF_UNIX, socket.SOCK_STREAM) as s:
--> 182             return recvfds(s, 1)[0]
    183 
    184     def DupFd(fd):

~/anaconda3/lib/python3.6/multiprocessing/reduction.py in recvfds(sock, size)
    159             if len(ancdata) != 1:
    160                 raise RuntimeError('received %d items of ancdata' %
--> 161                                    len(ancdata))
    162             cmsg_level, cmsg_type, cmsg_data = ancdata[0]
    163             if (cmsg_level == socket.SOL_SOCKET and

RuntimeError: received 0 items of ancdata

Thank–you, in advance

ramesh · November 16, 2017, 6:40pm

Yeah many of us had this issue this week. Take a look at - https://github.com/fastai/fastai/issues/23

Try the fix I suggested at the end there to increase ulimit and post if it works for you.

aymenim · November 16, 2017, 7:55pm

@ramesh thanks it solved the error I was having I will try to look at the code to find why so much file discriptors are kept opened. Thanks again.

priyal · February 14, 2018, 6:03pm

Got a runtime error while running prediction on planet dataset on Paperspace too (Machine Specifications- 30 GB RAM, P4000). The process just kept getting Killed without leaving any stack trace.
After reading the suggestions on https://github.com/fastai/fastai/issues/23 , I tried decreasing num_workers in ImageClassifierData to 2 (down from default 8), and it worked. Took about 15 minutes to run though.

I’m assuming that unlike training, testing doesn’t take place in batches, hence causing memory scarcity. Is that correct? If yes, what might be the reason?