Generalizing untar data to also work with zips

KevinB · September 1, 2019, 6:01pm

I have a suggestion that we should get rid of untar_data and replace it with uncompress_data.

The difference between these is uncompress_data will handle everything untar_data handled, but also can handle zip files.

Here are the code changes required to make this work:

#export
def uncompress_data(url, fname=None, dest=None, c_key=ConfigKey.Data, force_download=False):
    "Download `url` to `fname` if `dest` doesn't exist, and un-tgz to folder `dest`."
    default_dest = _url2path(url, c_key=c_key).with_suffix('')
    dest = default_dest if dest is None else Path(dest)/default_dest.name
    fname = Path(fname or _url2path(url))
    if fname.exists() and _get_check(url) and _check_file(fname) != _get_check(url):
        print("A new version of this is available, downloading...")
        force_download = True
    if force_download:
        if fname.exists(): os.remove(fname)
        if dest.exists(): shutil.rmtree(dest)
    if not dest.exists():
        fname = download_data(url, fname=fname, c_key=c_key)
        if _get_check(url) and _check_file(fname) != _get_check(url):
            print(f"File downloaded is broken. Remove {fname} and try again.")
        if tarfile.is_tarfile(fname):
            tarfile.open(fname, 'r:gz').extractall(dest.parent)
        elif zipfile.is_zipfile(fname):
            zipfile.ZipFile(fname, 'r').extractall(dest.parent)
        else:
            print(f'{fname.suffix} is not yet supported for decompressing')
    return dest

Let me know if this is an ok change and I will create a PR for it.

The other change required is that you have to import zipfile in imports.py

jeremy · September 1, 2019, 6:13pm

Thanks for the suggestion. Since all fast.ai datasets are tar/gz, and since we don’t want to support all possible compression formats, I think we’ll leave it as is. If you think functionality to provide more general decompression would be helpful, it might even make a nice little project you could create!

sgugger · September 1, 2019, 6:15pm

And if you create a python package that provides that functionality easily we can then use it in fastai to have a more general function. But we don’t feel it should be inside the fastai code as it’s not really DL-related.

KevinB · September 1, 2019, 6:19pm

Ok, I think I can put together a small library that is modeled off of untar_data that will take in a generic file and decompress it to a defined dest.

Currently thinking just supporting .zip, .tar, and .tgz files, do you think there are any other obvious ones I should add for this?

Here is a list of archive formats according to wikipedia: https://en.wikipedia.org/wiki/List_of_archive_formats

jeremy · September 1, 2019, 6:21pm

I see bzip2 used quite often, and 7zip sometimes.

KevinB · September 1, 2019, 6:37pm

Is it acceptable to use the code that is generated in nb 04_data_external? I definitely will rewrite it if you would rather me not use it, but it would speed up my development quite a bit if I could use that and just add credit in the readme and in the .py file. Not really sure what the protocol is on something like this.

KevinB · September 1, 2019, 6:44pm

Actually I don’t think I will need to use much, just this piece:

fname = Path(fname or _url2path(url))
        if tarfile.is_tarfile(fname):
            tarfile.open(fname, 'r:gz').extractall(dest.parent)
        elif zipfile.is_zipfile(fname):
            zipfile.ZipFile(fname, 'r').extractall(dest.parent)
        else:
            print(f'{fname.stem} is not yet supported for decompressing')

Because I am not going to do any of the downloading or anything with decompress. It is only going to take an fname in, decompress it and put it in dest.

jeremy · September 1, 2019, 7:44pm

Use whatever you like!

jeremy · September 1, 2019, 7:45pm

Looks like this might be helpful

KevinB · September 1, 2019, 8:15pm

It looks like it never was completed:


    def _decompress(self, data, **kwargs):
        raise NotImplementedError

Good to know others have identified it as a problem though. I am working on putting a package together now:

KevinB · September 1, 2019, 11:46pm

I believe I have created this now and put it into a pip package. Would you want it to be put into the untar_data function or would you want to create a new uncompress_data (or decompress_data). Otherwise if you just want to try the tool out for yourself and make the changes, it is available via pip install decompress==0.0.5

Actually while I was looking to a certain file type, I found another tool that I think might fit our needs that is already built out. It’s called pyunpack and works as so:

from pyunpack import Archive

Archive(file that needs decompressed).extractall(path_to_extract_location)

When I tested it, it seemed to work really well.

I tested 7z, tar, gz, tgz, and zip. They all worked well.

KevinB · September 2, 2019, 1:46am

This is my new proposed solution using pyunpack:

def decompress_data(url, fname=None, dest=None, c_key=ConfigKey.Data, force_download=False):
    "Download `url` to `fname` if `dest` doesn't exist, and un-tgz to folder `dest`."
    default_dest = _url2path(url, c_key=c_key).with_suffix('')
    dest = default_dest if dest is None else Path(dest)/default_dest.name
    fname = Path(fname or _url2path(url))
    if fname.exists() and _get_check(url) and _check_file(fname) != _get_check(url):
        print("A new version of this is available, downloading...")
        force_download = True
    if force_download:
        if fname.exists(): os.remove(fname)
        if dest.exists(): shutil.rmtree(dest)
    if not dest.exists():
        fname = download_data(url, fname=fname, c_key=c_key)
        if _get_check(url) and _check_file(fname) != _get_check(url):
            print(f"File downloaded is broken. Remove {fname} and try again.")
        Archive(fname).extractall(dest.parent)
    return dest

Also need to add this line to imports.py:

from pyunpack import Archive

And this to environment.yml in the pip section:

  - pyunpack
  - patool

This would enable a lot more compression types to be decompressed which I think would be good for more general use of the fastai library since there are a lot of zip files in the wild.

jeremy · September 2, 2019, 2:27pm

Thanks @KevinB. I’ve added a extract_func param now so you can easily use any function you like, including the one you found (thanks for digging that up!) Just create this func:

def arc_extract(fname, dest): Archive(fname).extractall(dest)

Then pass extract_func=arc_extract as a param and it should all work!

This way we don’t need any extra deps, and can allow users to use whatever packages they want to extract files.

KevinB · November 14, 2019, 4:07am

Here is what I am currently using when I want to unzip something:

from zipfile import ZipFile

url = "https://github.com/karoldvl/ESC-50/archive/master.zip"

def zip_extract(fname, dest):
    zipfile.ZipFile(fname, mode='r').extractall(dest)

path = untar_data(url, extract_func=zip_extract)

muellerzr · December 21, 2019, 7:27pm

@KevinB and Jeremy thank you for doing this!!! I’m currently working with the COCO dataset for human keypoints and this makes my life so much easier. I can’t thank you both enough

drscotthawley · October 23, 2021, 7:19pm

I realize this thread is from 2 years ago but I was just now looking for how to do this and…was briefly happy at having found an answer. Seems that extract_func is no longer included so that @KevinB’s remedy is no longer viable. Any recommendations for what to do instead?

(e.g I see download_data is also gone. I can run wget or curl manually but…just imagining there’s a more fastai-y way.)

KevinB · October 23, 2021, 8:33pm

Check out https://fastdownload.fast.ai/. This is the replacement for download_data. Does that fill your needs?

drscotthawley · October 23, 2021, 8:35pm

Thanks. I looked there but it still seems to b expecting a tar file:

github.com

fastai/fastdownload/blob/770ef9e92498889105ad3f8960fcd8e4bad529c5/fastdownload/core.py#L114

    
      
          def update(self, url):
              "Store the hash and size in `download_checks.py`"
              update_checks(urldest(url, self.arch_path()), url, self.module)
          
          
def extract(self, url, extract_key='data', force=False):
              "Extract archive already downloaded from `url`, overwriting existing if `force`"
              arch = urldest(url, self.arch_path())
              if not arch.exists(): raise Exception(f'{arch} does not exist')
              dest = self.data_path(extract_key)
              dest.mkdir(exist_ok=True, parents=True)
              return untar_dir(arch, dest, rename=True, overwrite=force)
          
          
def get(self, url, extract_key='data', force=False):
              "Download and extract `url`, overwriting existing if `force`"
              if not force:
                  data = self.data_path(extract_key, urldest(url, self.arch_path()))
                  if data.exists(): return data
              self.download(url, force=force)
              return self.extract(url, extract_key=extract_key, force=force)