Reading serialized arrays instead of images

echon · March 25, 2018, 11:38am

I’d like to train an image classifier with a dataset made up of serialized numpy arrays that can be read in with
numpy.load(path)
The arrays live in the same kind of directory tree as the ImageClassifierData.from_paths expects.
I would like to get the same functionality as provided by “from_paths” but with the files read through numpy.load instead of the current open_image (which uses csv).

What would be the best/simplest way to achieve such functionality? Copy/modify the source code? Inherit from ImageClassifierData?

radek · March 25, 2018, 11:53am

Depending on the size of your dataset and amount of available RAM, you might get a lot of mileage out of:

    def from_arrays(cls, path, trn, val, bs=64, tfms=(None,None), classes=None, num_workers=4, test=None):
        """ Read in images and their labels given as numpy arrays

which is a class method on ImageClassifierData. (first load the images, then feed them to ImageClassifierData).

I am sure you could hack the ImageClassifierData object or replace the datasets to use np.load or do something along those lines, but having gone down this path myself I think the best use of your time is seeing if you can conform the shape of your data to work with the functionality out of the box. I would probably go as far as writing the data to png files and loading them using ImageClassifierData.from_paths before I would attempt to mess with the internals.

If you would like to use np.load nonetheless, I can’t confirm this right now but I believe all you would need to change are the datasets that ImageClassifierData object references, something like icd.trn_ds = NewDataset(). IIRC I don’t think you would need to touch the dataloaders at all but maybe this theory is rubbish Nonetheless, 100% convinced looking at datasets would be a good starting point.

echon · March 25, 2018, 8:27pm

Thank you for your answer. I’ll check the datasets.
My data are represented as float numbers, which is why it is inconvenient to convert the arrays to png

echon · April 3, 2018, 5:00pm

Update: I did take a look at the datasets and that’s indeed the way to go. I’ve managed to load my arrays by inheriting from FilesIndexArrayset and overriding the definition of the get_x method. Then I inherited from ImageClassifierData and overrode the from_pahts method, essentially copying the original definition but using my derived FilesIndexArrayset class. After that, you can import the new classifier class and use it instead of the usual ImageClassifierData.
Example of use:


from serializedArrayData import SerializedArrayClassifierData
data = SerializedArrayClassifierData.from_paths(PATH, bs, tfms)

Here is the code for anyone interested


#File serializedArrayData.py

import numpy as np
import os
from fastai.dataset import FilesIndexArrayDataset, ImageClassifierData, folder_source

class FilesSerializedArrays(FilesIndexArrayDataset):
    def get_x(self, i): return np.load(os.path.join(self.path, self.fnames[i]))

class SerializedArrayClassifierData(ImageClassifierData):
    @classmethod
    def from_paths(cls, path, bs=64, tfms=(None,None), trn_name='train', val_name='valid', test_name=None, test_with_labels=False, num_workers=8):
        """ Read in images and their labels given as sub-folder names

        Arguments:
            path: a root path of the data (used for storing trained models, precomputed values, etc)
            bs: batch size
            tfms: transformations (for data augmentations). e.g. output of `tfms_from_model`
            trn_name: a name of the folder that contains training images.
            val_name:  a name of the folder that contains validation images.
            test_name:  a name of the folder that contains test images.
            num_workers: number of workers

        Returns:
            ImageClassifierData
        """
        assert not(tfms[0] is None or tfms[1] is None), "please provide transformations for your train and validation sets"
        trn,val = [folder_source(path, o) for o in (trn_name, val_name)]
        if test_name:
            test = folder_source(path, test_name) if test_with_labels else read_dir(path, test_name)
        else: test = None
        datasets = cls.get_ds(FilesSerializedArrays, trn, val, tfms, path=path, test=test)
        return cls(path, datasets, bs, num_workers, classes=trn[2])