Can we load zip archives with data_block?

Hey guys, greenhorn here.

I am trying to use the datablock API for a Kaggle competition. It seems to me that many kaggle datasets have the images archived in zip folders.

It would be extremely convenient if I could use the API with these archives something like…

data = (ImageList.from_csv(PATH, 'train.csv', folder='images.zip', suffix='.png')
        .split_by_folder(train='train_images.zip', valid='valid_images.zip')
        .label_from_df()
        .transform(size=224)
        .databunch())   

This gives me a basic error which I interpret to mean I am not loading anything successfully.
IndexError: index 0 is out of bounds for axis 0 with size 0

Is there a method for loading zip archives into the data_block? Or should I unzip them with some other method, and load them into the data_block in some other way.

Interesting thought. Currently I don’t believe we can, but I’m sure a modification could get it working. Here’s some ideas:

Thank you for the resource Zachary!

I realized my error was coming from elsewhere in that code, I made the mistake of conflating my testing and validation set.

As it turns out, Kaggle will handle loading the zip archives as folders. My code runs when I rename train_images.zip to train_images

Sweet!

Interesting. So you were able to do it after extracting the zip? Or without extraction

Without. My Kaggle workspace lists the data as zip archives, but it appears to serve them to the notebook as regular folders.

Got it. I may still look into the zip as a fun project. Certainly would help on space