NicWick
(Nicholas Wickman)
July 17, 2019, 3:39am
1
Hey guys, greenhorn here.
I am trying to use the datablock API for a Kaggle competition. It seems to me that many kaggle datasets have the images archived in zip folders.
It would be extremely convenient if I could use the API with these archives something like…
data = (ImageList.from_csv(PATH, 'train.csv', folder='images.zip', suffix='.png')
.split_by_folder(train='train_images.zip', valid='valid_images.zip')
.label_from_df()
.transform(size=224)
.databunch())
This gives me a basic error which I interpret to mean I am not loading anything successfully.
IndexError: index 0 is out of bounds for axis 0 with size 0
Is there a method for loading zip archives into the data_block? Or should I unzip them with some other method, and load them into the data_block in some other way.
muellerzr
(Zachary Mueller)
July 17, 2019, 3:50am
2
Interesting thought. Currently I don’t believe we can, but I’m sure a modification could get it working. Here’s some ideas:
python, zipfile
NicWick
(Nicholas Wickman)
July 17, 2019, 3:54am
3
Thank you for the resource Zachary!
I realized my error was coming from elsewhere in that code, I made the mistake of conflating my testing and validation set.
As it turns out, Kaggle will handle loading the zip archives as folders. My code runs when I rename train_images.zip to train_images
Sweet!
muellerzr
(Zachary Mueller)
July 17, 2019, 3:55am
4
Interesting. So you were able to do it after extracting the zip? Or without extraction
NicWick
(Nicholas Wickman)
July 17, 2019, 3:56am
5
Without. My Kaggle workspace lists the data as zip archives, but it appears to serve them to the notebook as regular folders.
muellerzr
(Zachary Mueller)
July 17, 2019, 4:04am
6
Got it. I may still look into the zip as a fun project. Certainly would help on space