Hey all, new user here. I am trying to load a dataset from the 2015 APTOS challenge from Kaggle.
Another user has provided the dataset. The image labels in the .csv do not exactly match the file names, but they are ordered. I have been rummaging around the data_block API but I think something must be going over my head.
Images are in /images. Labels are in labels/trainlabels.csv
I would like to load the images from folder, and apply the labels from the csv by index. Is this possible? Is it a bad idea? Should I structure the data some other way?
I don’t understand why you would want to label by index. What is a typical image filename and a typical label in the .csv ? I am sure there is a simple way to match the id in your label file to the image id.
A typical image filename is 10001_left.jpg
In the csv it looks like 01_left.jpg
Should I iterate through the df and add the leading digits?
Yes in your place I would probably try to save a new csv with the right filenames, it would probably be the easiest. Is it always 100 preceding the filename or is there at least any logic so that you can automate it within a dataframe ?
It seems that it is padded, with a 1 at the beginning for some reason. So the 1000th left eye image is named 11000_left.jpg
I am not sure that that holds though, as it’s a little strange to me to begin with. That is why I was considering using the index.
That is indeed strange. If you try creating a new csv with a function like:
def change_csv(old, new):
df = pd.read_csv(old)
new_df = pd.DataFrame(columns=['image', 'label'])
for row in df.itertuples():
image = row.image
label = row.label
new_df.loc[row.Index] = ['1'+'0'*(4-len(image.split('_')))+image, label]
And try to load your images with it, does it work ? Are all images there ?