Training data before and after processing with get_data are of different length

I am having a weird issue. When I am getting the training and validation images using get_batches images, the output is what I expect and is given below. There are, as expected 22980 images (11490 per class) in the train directory, and 2000 images (1000 images per class) in the valid directory.

Found 22980 images belonging to 2 classes.
Found 2000 images belonging to 2 classes.

However after calling get_data on the train_batches and val_batches, the shape of the train_data and val_data is not what I expect and is given below. As you can see there are only 22976, and 1984 images in train_data and valid_data respectively. Any idea what’s going wrong? I am using get_data to to use processed arrays instead of processing the images from disk every time using bcolz.

(22976, 3, 224, 224)
(1984, 3, 224, 224)

I have included the relevant code below for more details.

def get_batches(dirname, gen=image.ImageDataGenerator(), shuffle=True, 
            batch_size=4, class_mode='categorical'):
return gen.flow_from_directory(dirname, target_size=(224,224), 
            class_mode=class_mode, shuffle=shuffle, batch_size=batch_size)

DATA_DIR = "data/dogs-vs-cats-redux-kernels-edition/"

batch_size = 64
train_batches = get_batches(DATA_DIR + 'train', batch_size = batch_size)
val_batches = get_batches(DATA_DIR + 'valid', batch_size = batch_size)

val_data = get_data(val_batches)
train_data = get_data(train_batches)

print(train_data.shape) ## Prints (22976, 3, 224, 224)
print(val_data.shape) ## Prints (1984, 3, 224, 224)

val_classes = val_batches.classes
train_classes = train_batches.classes
val_labels = onehot(val_classes)
train_labels = onehot(train_classes)

print(train_labels.shape) ## Prints (22980, 2)
print(val_labels.shape)  ## Prints (2000, 2)

Here’s some more info:

train_batches.nb_sample # 22980 as expected

I verified the number of jpg files in the train and valid directory. There are a total of 22980 and 2000 images in the directories. Is it possible that some of the images are corrupted and hence we are unable to process them?

So I tried the exact same steps on a completely different dataset - this time on the statefarm images, and I still see the same issue.

The number of images in the train directory is 20824, but after saving them as processed arrays using bcolz the shape of the arrays is (20800, 3, 224, 224). Looks like this is not specific to one dataset, but seems to be consistent with other as well.

@jeremy: Any idea whats going on?

You should grab the updated version of get_data from platform.ai. It takes a path, instead of a generator. The problem is that the get_data I originally wrote (that you’re using) truncates the data if the number of items is not a multiple of the batch size - so the new version enforces a batch size of 1.

2 Likes

@jeremy: Thanks! That did the trick.

I did face a minor issue though. In utils.py we are importing cv2 library without using. I did not have OpenCV installed and importing utils failed. I commented out the import statement since we did not need it. Maybe we should update utils.zip with this change?

Thanks for the suggestion - we’ll probably be using it soon so I might just leave it there. You should be able to ‘conda install opencv’ BTW if you want to try it