EOF error with malaria dataset after first few batches.....any ideas?

I’m setting up to run the NIH malaria data set (26K images) and it constantly bails with this EOFerror after the first few batches:
(link to dataset: https://ceb.nlm.nih.gov/proj/malaria/cell_images.zip)

EOFError                                  Traceback (most recent call last)

----> 1 learn.fit(1)

/usr/local/lib/python3.6/site-packages/fastai/basic_train.py in fit(self, epochs, lr, wd, callbacks)
195 callbacks = [cb(self) for cb in self.callback_fns] + listify(callbacks)
196 if defaults.extra_callbacks is not None: callbacks += defaults.extra_callbacks
–> 197 fit(epochs, self, metrics=self.metrics, callbacks=self.callbacks+callbacks)
199 def create_opt(self, lr:Floats, wd:Floats=0.)->None:

/usr/local/lib/python3.6/site-packages/fastai/basic_train.py in fit(epochs, learn, callbacks, metrics)
97 cb_handler.set_dl(learn.data.train_dl)
98 cb_handler.on_epoch_begin()
—> 99 for xb,yb in progress_bar(learn.data.train_dl, parent=pbar):
100 xb, yb = cb_handler.on_batch_begin(xb, yb)
101 loss = loss_batch(learn.model, xb, yb, learn.loss_func, learn.opt, cb_handler)

/usr/local/lib/python3.6/site-packages/fastprogress/fastprogress.py in iter(self)
70 self.update(0)
71 try:
—> 72 for i,o in enumerate(self._gen):
73 if i >= self.total: break
74 yield o

/usr/local/lib/python3.6/site-packages/fastai/basic_data.py in iter(self)
73 def iter(self):
74 “Process and returns items from DataLoader.”
—> 75 for b in self.dl: yield self.proc_batch(b)
77 @classmethod

/usr/local/lib/python3.6/site-packages/torch/utils/data/dataloader.py in next(self)
635 self.reorder_dict[idx] = batch
636 continue
–> 637 return self._process_next_batch(batch)
639 next = next # Python 2 compatibility

/usr/local/lib/python3.6/site-packages/torch/utils/data/dataloader.py in _process_next_batch(self, batch)
656 self._put_indices()
657 if isinstance(batch, ExceptionWrapper):
–> 658 raise batch.exc_type(batch.exc_msg)
659 return batch

EOFError: Traceback (most recent call last):
File “/usr/local/lib/python3.6/site-packages/torch/utils/data/dataloader.py”, line 138, in worker_loop
samples = collate_fn([dataset[i] for i in batch_indices])
File “/usr/local/lib/python3.6/site-packages/torch/utils/data/dataloader.py”, line 138, in
samples = collate_fn([dataset[i] for i in batch_indices])
File “/usr/local/lib/python3.6/site-packages/fastai/data_block.py”, line 648, in getitem
if self.item is None: x,y = self.x[idxs],self.y[idxs]
File “/usr/local/lib/python3.6/site-packages/fastai/data_block.py”, line 118, in getitem
if isinstance(idxs, Integral): return self.get(idxs)
File “/usr/local/lib/python3.6/site-packages/fastai/vision/data.py”, line 271, in get
res = self.open(fn)
File “/usr/local/lib/python3.6/site-packages/fastai/vision/data.py”, line 267, in open
return open_image(fn, convert_mode=self.convert_mode, after_open=self.after_open)
File “/usr/local/lib/python3.6/site-packages/fastai/vision/image.py”, line 393, in open_image
x = PIL.Image.open(fn).convert(convert_mode)
File “/usr/local/lib/python3.6/site-packages/PIL/Image.py”, line 915, in convert
File “/usr/local/lib/python3.6/site-packages/PIL/ImageFile.py”, line 250, in load
File “/usr/local/lib/python3.6/site-packages/PIL/PngImagePlugin.py”, line 677, in load_end
self.png.call(cid, pos, length)
File “/usr/local/lib/python3.6/site-packages/PIL/PngImagePlugin.py”, line 140, in call
return getattr(self, "chunk
" + cid.decode(‘ascii’))(pos, length)
File “/usr/local/lib/python3.6/site-packages/PIL/PngImagePlugin.py”, line 356, in chunk_IDAT
raise EOFError

There’s nothing special going on here - 128x128 files, batch size of 90.
My imagedatabunch shows fine (i.e. showdata) ,etc.
But, once I start training…it gets a few batches in an blows up.

Things I tried to fix:
1 - competely scrubbed the training directories in case of file corruption. Re-uploaded everything.
2 - there was a thumbs.db in each category dir…removed those in case it was somehow trying to load that.
3 - Ran fastai install --upgrade just to be safe.
4 - Tried two different nets in case that was the issue (was not).

The main difference that i can see is this dataset is quite large (26K files)…but surely others have worked with much larger and not this…

Anyway, I’ve spent hours on this so if anyone has any insight, that would be great!

and view of Imagedatabunch:

Transforms =  2

[‘Parasitized’, ‘Uninfected’]

Train: LabelList (22047 items)
x: ImageList
Image (3, 128, 128),Image (3, 128, 128),Image (3, 128, 128),Image (3, 128, 128),Image (3, 128, 128)
y: CategoryList
Path: data/train;

Valid: LabelList (5511 items)
x: ImageList
Image (3, 128, 128),Image (3, 128, 128),Image (3, 128, 128),Image (3, 128, 128),Image (3, 128, 128)
y: CategoryList
Path: data/train;

Test: None

I was not able to repeat this issue in google colab… I did the following:

!wget https://ceb.nlm.nih.gov/proj/malaria/cell_images.zip
!unzip cell_images.zip

from fastai.vision import *

path = Path('cell_images')
data = ImageDataBunch.from_folder(path, valid_pct=0.2, size=128)

learn = cnn_learner(data, models.resnet34, metrics=accuracy)

I also tried with just learn.fit() and also did not see your error. How are you extracting the file? And how are you setting up your databunch, etc?

Couple of things you could do as a pre-check when dealing with large number of images - before you actually start training .

1. Run a quick script to check if the images are all readable and files corrput. This script throws an exception if its not able to read any of the files- so you can isolate it

import os
from PIL import Image
DIR = '../images/Uninfected/'
for filename in os.listdir(DIR):
	if filename.endswith('.png'):
		img = Image.open(DIR+filename)
		except e1 as Exception  :
			print("Exception:" + str(e1))	

2. Allow to load partial images ( am not sure if this is relevant here)


In your case you could -
a) try ‘reproducing the error’ with a smaller subset ( instead of the entire dataset of 27K images ) into test and valid.
b) I tried with a batch size of 64 and worked . I see you mentioned batchsize is 90.( though this shouldn’t be a problem- not sure )
Just out of curiosity , I tried reproducing it . Ran with a high lr. It works fine on my system. ( looks like a download issue ?? -not sure again)

learn.fit_one_cycle(1,max_lr = 1e-2)


Train: LabelList (23015 items)
x: ImageList
Image (3, 224, 224),Image (3, 224, 224),Image (3, 224, 224),Image (3, 224, 224),Image (3, 224, 224)
y: CategoryList
Path: /media/***/New Volume/data/malaria;

Valid: LabelList (4543 items)
x: ImageList
Image (3, 224, 224),Image (3, 224, 224),Image (3, 224, 224),Image (3, 224, 224),Image (3, 224, 224)
y: CategoryList
Path: /media/***/New Volume/data/malaria;

Test: None

Thanks @muellerzr and @piby4 for the help here! At least I know it’s something specific to my platform or similar and not generic FastAI or dataset!

1 - I did use @piby4’s file checker code and ran through all images with no issue, so can at least confirm not a file corruption issue. (thanks for the code here!)

2 - I decided to mimic exactly the simple setup (no transforms, etc) @muellerzr used to see if it could be a transform issue…and no luck.

3 - I tried smaller batch sizes all the way down to 5…and no luck.

At least the error is easy to reproduce though with all of the above eliminated I’m starting to wonder if it’s something with the OS/platform interaction (I’m running on FloydHub) though I’ve never had an issue like this and have used 10K image sets there.
The one difference to me is I’ve never used folder by pct…always CSV by percent or Imagenet style folders.

I’m going to try @piby4 LoadTruncated images next, and then try reworking with a different split setup and finally setup on my slower laptop with a subset of images…

I’ll post with any news - thanks again!

Well I’m throwing in the towel on this after 5+ hours…I ran the pets notebook really quickly without changing anything as a basic test and it loaded and ran things as expected.
It’s clearly (to me) some issue with loading by pct, and on the Floydhub platform.

I re-downloaded one more time directly from NIH to Floydhub as a sep dataset, and tested with that and exact same issue.
Basically, the first epoch always gets to 8% to 25% of the epoch and then blows up with the EOF issue.

I’ll go look for a new platform (Salamander I think) and rerun

Well - presto chango - ran on Salamander = 100% no EOF issue.
So, clearly some issue with OS or similar on FloydHub and this dataset.

Of note, the CPU on Salamander seems to be way faster…on Floydhub, just unpacking the files for the dataset took like 4 minutes vs about 30 seconds on Salamander.
Thus, it’s possible this is simply the CPU not being able to keep up with loading images and sending as next batch??

Anyway, thanks for the help here! At least I can move forward with working on the malaria dataset now!