Generalizing untar data to also work with zips

I have a suggestion that we should get rid of untar_data and replace it with uncompress_data.

The difference between these is uncompress_data will handle everything untar_data handled, but also can handle zip files.

Here are the code changes required to make this work:

#export
def uncompress_data(url, fname=None, dest=None, c_key=ConfigKey.Data, force_download=False):
    "Download `url` to `fname` if `dest` doesn't exist, and un-tgz to folder `dest`."
    default_dest = _url2path(url, c_key=c_key).with_suffix('')
    dest = default_dest if dest is None else Path(dest)/default_dest.name
    fname = Path(fname or _url2path(url))
    if fname.exists() and _get_check(url) and _check_file(fname) != _get_check(url):
        print("A new version of this is available, downloading...")
        force_download = True
    if force_download:
        if fname.exists(): os.remove(fname)
        if dest.exists(): shutil.rmtree(dest)
    if not dest.exists():
        fname = download_data(url, fname=fname, c_key=c_key)
        if _get_check(url) and _check_file(fname) != _get_check(url):
            print(f"File downloaded is broken. Remove {fname} and try again.")
        if tarfile.is_tarfile(fname):
            tarfile.open(fname, 'r:gz').extractall(dest.parent)
        elif zipfile.is_zipfile(fname):
            zipfile.ZipFile(fname, 'r').extractall(dest.parent)
        else:
            print(f'{fname.suffix} is not yet supported for decompressing')
    return dest

Let me know if this is an ok change and I will create a PR for it.

The other change required is that you have to import zipfile in imports.py

3 Likes

Thanks for the suggestion. Since all fast.ai datasets are tar/gz, and since we don’t want to support all possible compression formats, I think we’ll leave it as is. If you think functionality to provide more general decompression would be helpful, it might even make a nice little project you could create! :slight_smile:

2 Likes

And if you create a python package that provides that functionality easily we can then use it in fastai to have a more general function. But we don’t feel it should be inside the fastai code as it’s not really DL-related.

2 Likes

Ok, I think I can put together a small library that is modeled off of untar_data that will take in a generic file and decompress it to a defined dest.

Currently thinking just supporting .zip, .tar, and .tgz files, do you think there are any other obvious ones I should add for this?

Here is a list of archive formats according to wikipedia: https://en.wikipedia.org/wiki/List_of_archive_formats

I see bzip2 used quite often, and 7zip sometimes.

2 Likes

Is it acceptable to use the code that is generated in nb 04_data_external? I definitely will rewrite it if you would rather me not use it, but it would speed up my development quite a bit if I could use that and just add credit in the readme and in the .py file. Not really sure what the protocol is on something like this.

Actually I don’t think I will need to use much, just this piece:

fname = Path(fname or _url2path(url))
        if tarfile.is_tarfile(fname):
            tarfile.open(fname, 'r:gz').extractall(dest.parent)
        elif zipfile.is_zipfile(fname):
            zipfile.ZipFile(fname, 'r').extractall(dest.parent)
        else:
            print(f'{fname.stem} is not yet supported for decompressing')

Because I am not going to do any of the downloading or anything with decompress. It is only going to take an fname in, decompress it and put it in dest.

Use whatever you like! :slight_smile:

Looks like this might be helpful

It looks like it never was completed:


    def _decompress(self, data, **kwargs):
        raise NotImplementedError

Good to know others have identified it as a problem though. I am working on putting a package together now:

1 Like

I believe I have created this now and put it into a pip package. Would you want it to be put into the untar_data function or would you want to create a new uncompress_data (or decompress_data). Otherwise if you just want to try the tool out for yourself and make the changes, it is available via pip install decompress==0.0.5

Actually while I was looking to a certain file type, I found another tool that I think might fit our needs that is already built out. It’s called pyunpack and works as so:

from pyunpack import Archive

Archive(file that needs decompressed).extractall(path_to_extract_location)

When I tested it, it seemed to work really well.

I tested 7z, tar, gz, tgz, and zip. They all worked well.

2 Likes

This is my new proposed solution using pyunpack:

def decompress_data(url, fname=None, dest=None, c_key=ConfigKey.Data, force_download=False):
    "Download `url` to `fname` if `dest` doesn't exist, and un-tgz to folder `dest`."
    default_dest = _url2path(url, c_key=c_key).with_suffix('')
    dest = default_dest if dest is None else Path(dest)/default_dest.name
    fname = Path(fname or _url2path(url))
    if fname.exists() and _get_check(url) and _check_file(fname) != _get_check(url):
        print("A new version of this is available, downloading...")
        force_download = True
    if force_download:
        if fname.exists(): os.remove(fname)
        if dest.exists(): shutil.rmtree(dest)
    if not dest.exists():
        fname = download_data(url, fname=fname, c_key=c_key)
        if _get_check(url) and _check_file(fname) != _get_check(url):
            print(f"File downloaded is broken. Remove {fname} and try again.")
        Archive(fname).extractall(dest.parent)
    return dest

Also need to add this line to imports.py:

from pyunpack import Archive

And this to environment.yml in the pip section:

  - pyunpack
  - patool

This would enable a lot more compression types to be decompressed which I think would be good for more general use of the fastai library since there are a lot of zip files in the wild.

2 Likes

Thanks @KevinB. I’ve added a extract_func param now so you can easily use any function you like, including the one you found (thanks for digging that up!) Just create this func:

def arc_extract(fname, dest): Archive(fname).extractall(dest)

Then pass extract_func=arc_extract as a param and it should all work! :slight_smile:

This way we don’t need any extra deps, and can allow users to use whatever packages they want to extract files.

5 Likes

Here is what I am currently using when I want to unzip something:

from zipfile import ZipFile

url = "https://github.com/karoldvl/ESC-50/archive/master.zip"

def zip_extract(fname, dest):
    zipfile.ZipFile(fname, mode='r').extractall(dest)

path = untar_data(url, extract_func=zip_extract)
6 Likes

@KevinB and Jeremy thank you for doing this!!! I’m currently working with the COCO dataset for human keypoints and this makes my life so much easier. I can’t thank you both enough :slight_smile:

2 Likes

I realize this thread is from 2 years ago but I was just now looking for how to do this and…was briefly happy at having found an answer. Seems that extract_func is no longer included so that @KevinB’s remedy is no longer viable. Any recommendations for what to do instead?

(e.g I see download_data is also gone. I can run wget or curl manually but…just imagining there’s a more fastai-y way.)

Check out https://fastdownload.fast.ai/. This is the replacement for download_data. Does that fill your needs?

Thanks. I looked there but it still seems to b expecting a tar file:

1 Like