How do you create a DataBunch from a CSV? (Images and Labels in same CSV)

Hi everyone,

probably a simple question. When downloading the MNIST dataset instead of using untar_data it comes as a CSV file.

A row in the CSV contains the pixel values for one image. The first value in a row is an exception as it is the label corresponding to the row.

My question: How can I get data which is stored in that way into a DataBunch?

I imagine there would have to be a way to access rows/columns of the CSV individually. Especially since it seems to be fairly common to store data in such a format.
Using ImageDataBunch.from_csv() doesn’t work. The dataloader for the labels seems to expect another CSV instead of a single value out of the same CSV.

So far I have been stumped. Any kind of hints would be greatly appreciated.

1 Like

I’m not sure exactly what the format you’re describing is, but if you need more flexibility than from_csv can provide you might want to import it using pandas into a dataframe and then use ImageDataBunch.from_df to get it into your databunch.

Thank you. I will try that.

I also found the function ImageDataBunch.from_lists() which might work.
It’s very flexible but also seems to require more manual work than other functions.

This got me one step further:

df = pd.read_csv('data/cleaned.csv')
data = ImageDataBunch.from_df(path=Path('data'), df=df, ds_tfms=get_transforms(), size=224, bs=32)

I am not sure why path is needed though…


Thanks for your answer. I have also been trying around with dataframes. Ultimately I came to the conclusion though that it doesn’t seem possible to use the CSV directly. At least not if you want to use the functions given to you by fastai.

Apparently you first have to convert the pixel values from the CSV into actual pictures an then set up an appropriate folder structure to load them into a DataBunch.
I mainly used code from here to do that:

Here is my own code:

from fastai import *
from import *
import imageio

path = 'C:\\Users\\Me\\Desktop\\Folder\\MNIST'
train = pd.read_csv(path + '\\train.csv')

def to_img_shape(data_X, data_y=[]):
    data_X = np.array(data_X).reshape(-1,28,28)
    data_X = np.stack((data_X,)*3, axis=-1)
    data_y = np.array(data_y)
    return data_X,data_y

data_X, data_y = train.loc[:,'pixel0':'pixel783'], train['label']

train_X, train_y = data_X, data_y

train_X,train_y = to_img_shape(data_X, data_y)

def save_imgs(path:Path, data, labels):
    for label in np.unique(labels):
    for i in range(len(data)):
        newImage = data[i].astype(np.uint8)
            imageio.imsave( str( path/str(labels[i])/(str(i)+'.jpg') ), newImage )
            imageio.imsave( str( path/(str(i)+'.jpg') ), newImage )


With this you will get one folder for each label. In the folder with the label ‘0’ all the pictures with said label will be stored.

The function from_csv(...) is working it is just that it needs a label.csv file.

See more here:

My issue is that I don’t understand fully how to use the cleaned.csv which is the output when cleaning my dataset with ImageCleaner.

Anyway, it seems like you are on your way with our own task, good luck!

1 Like