How to create a imagelist from images storaged in csv

I am a beginner in fastai, who is trying to finish Kaggle’s digit recognizer competition.

But I get stuck immediately when I trying to load the image data to the ImageList.

The data provided is a csv file with pixels and labels of the images, which looks like this:

label, pixel1, pixel2, ... , pixel 784
1, 0, 0, ... , 0
...

It seems that ImageList API can only apply to the image files that are storaged in the disk and that are well organized by its label, which is not suitable for the image storaged in the csv file.

I have found some solutions using ImageClassifierData, but I can’t find it in the document and it seems to be supersceded.

I have also found some suggestions recommending that a custom dataset would help, but it also means that I couldn’t use the functions in fastai (validation split and show bunch , etc). It might too hard for me.

What should I do? Any suggestions or example code would be appreciated.

1 Like

Check out tabular loader, https://docs.fast.ai/tabular.data.html might help.

Finally I managed to build a custom dataset myself.

Although it’s a little bit hard for me to understand how a custom dataset works, it also means I will learn more during this experience.

Here are my solutions.

class NumpyImageList(ImageList):
    def open(self, fn):
        img = fn.reshape(28,28,1)
        return Image(pil2tensor(img, dtype=np.float32))
    
    @classmethod
    def from_csv(cls, path:PathOrStr, csv:str, **kwargs)->'ItemList': 
        df = pd.read_csv(Path(path)/csv, header='infer')
        res = super().from_df(df, path=path, cols=0, **kwargs)
        if 'label' in df.columns:
            df = df.drop('label', axis=1)
        df = np.array(df)/255.
        mean = df.mean()
        std = df.std()
        res.items = (df-mean)/std
        return res

And we can use it normally as ImageList

test = NumpyImageList.from_csv('../input/', 'test.csv')
tfms = get_transforms(do_flip=False)
data = (NumpyImageList.from_csv('../input/', 'train.csv')
        .split_by_rand_pct(.1)
        .label_from_df(cols='label')
        .add_test(test, label=0)
        .transform(tfms)
        .databunch(bs=128, num_workers=0)
        .normalize(imagenet_stats))
data

# Output:
# ImageDataBunch;
# 
# Train: LabelList (37800 items)
# x: NumpyImageList
# Image (1, 28, 28),Image (1, 28, 28),Image (1, 28, 28),Image (1, 28, 28),Image (1, 28, 28)
# y: CategoryList
# 1,0,1,4,0
# Path: ../input;
# 
# Valid: LabelList (4200 items)
# x: NumpyImageList
# Image (1, 28, 28),Image (1, 28, 28),Image (1, 28, 28),Image (1, 28, 28),Image (1, 28, 28)
# y: CategoryList
# 4,6,6,8,2
# Path: ../input;
# 
# Test: LabelList (28000 items)
# x: NumpyImageList
# Image (1, 28, 28),Image (1, 28, 28),Image (1, 28, 28),Image (1, 28, 28),Image (1, 28, 28)
# y: CategoryList
# 0,0,0,0,0
# Path: ../input
2 Likes

I have done the same thing in a bit different approach here; without overriding from_csv method. A step by step details is also there.
Hope this helps :slight_smile:

Thanks.

I have had a breif look to your kernel. I think that actually we are doing the same thing and only different in the code style.

It was a great pity that I didn’t have a further searching yesterday so that I had missed it.

Here is my kernel ( just for fun )

https://www.kaggle.com/hanslee01/digit-recognizer-with-cnn-fastai

1 Like

Ah, I had just made it public like an hour or two ago.