How to create a imagelist from images storaged in csv

HansLee · June 9, 2019, 9:34am

I am a beginner in fastai, who is trying to finish Kaggle’s digit recognizer competition.

But I get stuck immediately when I trying to load the image data to the ImageList.

The data provided is a csv file with pixels and labels of the images, which looks like this:

label, pixel1, pixel2, ... , pixel 784
1, 0, 0, ... , 0
...

It seems that ImageList API can only apply to the image files that are storaged in the disk and that are well organized by its label, which is not suitable for the image storaged in the csv file.

I have found some solutions using ImageClassifierData, but I can’t find it in the document and it seems to be supersceded.

I have also found some suggestions recommending that a custom dataset would help, but it also means that I couldn’t use the functions in fastai (validation split and show bunch , etc). It might too hard for me.

What should I do? Any suggestions or example code would be appreciated.

navidpanchi · June 9, 2019, 12:37pm

Check out tabular loader, https://docs.fast.ai/tabular.data.html might help.

HansLee · June 9, 2019, 12:37pm

Finally I managed to build a custom dataset myself.

Although it’s a little bit hard for me to understand how a custom dataset works, it also means I will learn more during this experience.

Here are my solutions.

class NumpyImageList(ImageList):
    def open(self, fn):
        img = fn.reshape(28,28,1)
        return Image(pil2tensor(img, dtype=np.float32))
    
    @classmethod
    def from_csv(cls, path:PathOrStr, csv:str, **kwargs)->'ItemList': 
        df = pd.read_csv(Path(path)/csv, header='infer')
        res = super().from_df(df, path=path, cols=0, **kwargs)
        if 'label' in df.columns:
            df = df.drop('label', axis=1)
        df = np.array(df)/255.
        mean = df.mean()
        std = df.std()
        res.items = (df-mean)/std
        return res

And we can use it normally as ImageList

test = NumpyImageList.from_csv('../input/', 'test.csv')
tfms = get_transforms(do_flip=False)
data = (NumpyImageList.from_csv('../input/', 'train.csv')
        .split_by_rand_pct(.1)
        .label_from_df(cols='label')
        .add_test(test, label=0)
        .transform(tfms)
        .databunch(bs=128, num_workers=0)
        .normalize(imagenet_stats))
data

# Output:
# ImageDataBunch;
# 
# Train: LabelList (37800 items)
# x: NumpyImageList
# Image (1, 28, 28),Image (1, 28, 28),Image (1, 28, 28),Image (1, 28, 28),Image (1, 28, 28)
# y: CategoryList
# 1,0,1,4,0
# Path: ../input;
# 
# Valid: LabelList (4200 items)
# x: NumpyImageList
# Image (1, 28, 28),Image (1, 28, 28),Image (1, 28, 28),Image (1, 28, 28),Image (1, 28, 28)
# y: CategoryList
# 4,6,6,8,2
# Path: ../input;
# 
# Test: LabelList (28000 items)
# x: NumpyImageList
# Image (1, 28, 28),Image (1, 28, 28),Image (1, 28, 28),Image (1, 28, 28),Image (1, 28, 28)
# y: CategoryList
# 0,0,0,0,0
# Path: ../input

abyaadrafid · June 10, 2019, 9:04am

I have done the same thing in a bit different approach here; without overriding from_csv method. A step by step details is also there.
Hope this helps

HansLee · June 10, 2019, 10:27am

Thanks.

I have had a breif look to your kernel. I think that actually we are doing the same thing and only different in the code style.

It was a great pity that I didn’t have a further searching yesterday so that I had missed it.

Here is my kernel ( just for fun )

https://www.kaggle.com/hanslee01/digit-recognizer-with-cnn-fastai

abyaadrafid · June 10, 2019, 10:35am

Ah, I had just made it public like an hour or two ago.