Can we load zip archives with data_block?

NicWick · July 17, 2019, 3:39am

Hey guys, greenhorn here.

I am trying to use the datablock API for a Kaggle competition. It seems to me that many kaggle datasets have the images archived in zip folders.

It would be extremely convenient if I could use the API with these archives something like…

data = (ImageList.from_csv(PATH, 'train.csv', folder='images.zip', suffix='.png')
        .split_by_folder(train='train_images.zip', valid='valid_images.zip')
        .label_from_df()
        .transform(size=224)
        .databunch())

This gives me a basic error which I interpret to mean I am not loading anything successfully.
IndexError: index 0 is out of bounds for axis 0 with size 0

Is there a method for loading zip archives into the data_block? Or should I unzip them with some other method, and load them into the data_block in some other way.

muellerzr · July 17, 2019, 3:50am

Interesting thought. Currently I don’t believe we can, but I’m sure a modification could get it working. Here’s some ideas:

NicWick · July 17, 2019, 3:54am

Thank you for the resource Zachary!

I realized my error was coming from elsewhere in that code, I made the mistake of conflating my testing and validation set.

As it turns out, Kaggle will handle loading the zip archives as folders. My code runs when I rename train_images.zip to train_images

Sweet!

muellerzr · July 17, 2019, 3:55am

Interesting. So you were able to do it after extracting the zip? Or without extraction

NicWick · July 17, 2019, 3:56am

Without. My Kaggle workspace lists the data as zip archives, but it appears to serve them to the notebook as regular folders.

muellerzr · July 17, 2019, 4:04am

Got it. I may still look into the zip as a fun project. Certainly would help on space