I’m setting up to run the NIH malaria data set (26K images) and it constantly bails with this EOFerror after the first few batches:
(link to dataset: https://ceb.nlm.nih.gov/proj/malaria/cell_images.zip)
EOFError Traceback (most recent call last)
in
----> 1 learn.fit(1)
/usr/local/lib/python3.6/site-packages/fastai/basic_train.py in fit(self, epochs, lr, wd, callbacks)
195 callbacks = [cb(self) for cb in self.callback_fns] + listify(callbacks)
196 if defaults.extra_callbacks is not None: callbacks += defaults.extra_callbacks
–> 197 fit(epochs, self, metrics=self.metrics, callbacks=self.callbacks+callbacks)
198
199 def create_opt(self, lr:Floats, wd:Floats=0.)->None:
/usr/local/lib/python3.6/site-packages/fastai/basic_train.py in fit(epochs, learn, callbacks, metrics)
97 cb_handler.set_dl(learn.data.train_dl)
98 cb_handler.on_epoch_begin()
—> 99 for xb,yb in progress_bar(learn.data.train_dl, parent=pbar):
100 xb, yb = cb_handler.on_batch_begin(xb, yb)
101 loss = loss_batch(learn.model, xb, yb, learn.loss_func, learn.opt, cb_handler)
/usr/local/lib/python3.6/site-packages/fastprogress/fastprogress.py in iter(self)
70 self.update(0)
71 try:
—> 72 for i,o in enumerate(self._gen):
73 if i >= self.total: break
74 yield o
/usr/local/lib/python3.6/site-packages/fastai/basic_data.py in iter(self)
73 def iter(self):
74 “Process and returns items from DataLoader
.”
—> 75 for b in self.dl: yield self.proc_batch(b)
76
77 @classmethod
/usr/local/lib/python3.6/site-packages/torch/utils/data/dataloader.py in next(self)
635 self.reorder_dict[idx] = batch
636 continue
–> 637 return self._process_next_batch(batch)
638
639 next = next # Python 2 compatibility
/usr/local/lib/python3.6/site-packages/torch/utils/data/dataloader.py in _process_next_batch(self, batch)
656 self._put_indices()
657 if isinstance(batch, ExceptionWrapper):
–> 658 raise batch.exc_type(batch.exc_msg)
659 return batch
660
EOFError: Traceback (most recent call last):
File “/usr/local/lib/python3.6/site-packages/torch/utils/data/dataloader.py”, line 138, in worker_loop
samples = collate_fn([dataset[i] for i in batch_indices])
File “/usr/local/lib/python3.6/site-packages/torch/utils/data/dataloader.py”, line 138, in
samples = collate_fn([dataset[i] for i in batch_indices])
File “/usr/local/lib/python3.6/site-packages/fastai/data_block.py”, line 648, in getitem
if self.item is None: x,y = self.x[idxs],self.y[idxs]
File “/usr/local/lib/python3.6/site-packages/fastai/data_block.py”, line 118, in getitem
if isinstance(idxs, Integral): return self.get(idxs)
File “/usr/local/lib/python3.6/site-packages/fastai/vision/data.py”, line 271, in get
res = self.open(fn)
File “/usr/local/lib/python3.6/site-packages/fastai/vision/data.py”, line 267, in open
return open_image(fn, convert_mode=self.convert_mode, after_open=self.after_open)
File “/usr/local/lib/python3.6/site-packages/fastai/vision/image.py”, line 393, in open_image
x = PIL.Image.open(fn).convert(convert_mode)
File “/usr/local/lib/python3.6/site-packages/PIL/Image.py”, line 915, in convert
self.load()
File “/usr/local/lib/python3.6/site-packages/PIL/ImageFile.py”, line 250, in load
self.load_end()
File “/usr/local/lib/python3.6/site-packages/PIL/PngImagePlugin.py”, line 677, in load_end
self.png.call(cid, pos, length)
File “/usr/local/lib/python3.6/site-packages/PIL/PngImagePlugin.py”, line 140, in call
return getattr(self, "chunk" + cid.decode(‘ascii’))(pos, length)
File “/usr/local/lib/python3.6/site-packages/PIL/PngImagePlugin.py”, line 356, in chunk_IDAT
raise EOFError
EOFError
There’s nothing special going on here - 128x128 files, batch size of 90.
My imagedatabunch shows fine (i.e. showdata) ,etc.
But, once I start training…it gets a few batches in an blows up.
Things I tried to fix:
1 - competely scrubbed the training directories in case of file corruption. Re-uploaded everything.
2 - there was a thumbs.db in each category dir…removed those in case it was somehow trying to load that.
3 - Ran fastai install --upgrade just to be safe.
4 - Tried two different nets in case that was the issue (was not).
The main difference that i can see is this dataset is quite large (26K files)…but surely others have worked with much larger and not this…
Anyway, I’ve spent hours on this so if anyone has any insight, that would be great!