I have a suggestion that we should get rid of untar_data and replace it with uncompress_data.
The difference between these is uncompress_data will handle everything untar_data handled, but also can handle zip files.
Here are the code changes required to make this work:
#export
def uncompress_data(url, fname=None, dest=None, c_key=ConfigKey.Data, force_download=False):
"Download `url` to `fname` if `dest` doesn't exist, and un-tgz to folder `dest`."
default_dest = _url2path(url, c_key=c_key).with_suffix('')
dest = default_dest if dest is None else Path(dest)/default_dest.name
fname = Path(fname or _url2path(url))
if fname.exists() and _get_check(url) and _check_file(fname) != _get_check(url):
print("A new version of this is available, downloading...")
force_download = True
if force_download:
if fname.exists(): os.remove(fname)
if dest.exists(): shutil.rmtree(dest)
if not dest.exists():
fname = download_data(url, fname=fname, c_key=c_key)
if _get_check(url) and _check_file(fname) != _get_check(url):
print(f"File downloaded is broken. Remove {fname} and try again.")
if tarfile.is_tarfile(fname):
tarfile.open(fname, 'r:gz').extractall(dest.parent)
elif zipfile.is_zipfile(fname):
zipfile.ZipFile(fname, 'r').extractall(dest.parent)
else:
print(f'{fname.suffix} is not yet supported for decompressing')
return dest
Let me know if this is an ok change and I will create a PR for it.
The other change required is that you have to import zipfile in imports.py
Thanks for the suggestion. Since all fast.ai datasets are tar/gz, and since we don’t want to support all possible compression formats, I think we’ll leave it as is. If you think functionality to provide more general decompression would be helpful, it might even make a nice little project you could create!
And if you create a python package that provides that functionality easily we can then use it in fastai to have a more general function. But we don’t feel it should be inside the fastai code as it’s not really DL-related.
Is it acceptable to use the code that is generated in nb 04_data_external? I definitely will rewrite it if you would rather me not use it, but it would speed up my development quite a bit if I could use that and just add credit in the readme and in the .py file. Not really sure what the protocol is on something like this.
Actually I don’t think I will need to use much, just this piece:
fname = Path(fname or _url2path(url))
if tarfile.is_tarfile(fname):
tarfile.open(fname, 'r:gz').extractall(dest.parent)
elif zipfile.is_zipfile(fname):
zipfile.ZipFile(fname, 'r').extractall(dest.parent)
else:
print(f'{fname.stem} is not yet supported for decompressing')
Because I am not going to do any of the downloading or anything with decompress. It is only going to take an fname in, decompress it and put it in dest.
I believe I have created this now and put it into a pip package. Would you want it to be put into the untar_data function or would you want to create a new uncompress_data (or decompress_data). Otherwise if you just want to try the tool out for yourself and make the changes, it is available via pip install decompress==0.0.5
Actually while I was looking to a certain file type, I found another tool that I think might fit our needs that is already built out. It’s called pyunpack and works as so:
from pyunpack import Archive
Archive(file that needs decompressed).extractall(path_to_extract_location)
When I tested it, it seemed to work really well.
I tested 7z, tar, gz, tgz, and zip. They all worked well.
def decompress_data(url, fname=None, dest=None, c_key=ConfigKey.Data, force_download=False):
"Download `url` to `fname` if `dest` doesn't exist, and un-tgz to folder `dest`."
default_dest = _url2path(url, c_key=c_key).with_suffix('')
dest = default_dest if dest is None else Path(dest)/default_dest.name
fname = Path(fname or _url2path(url))
if fname.exists() and _get_check(url) and _check_file(fname) != _get_check(url):
print("A new version of this is available, downloading...")
force_download = True
if force_download:
if fname.exists(): os.remove(fname)
if dest.exists(): shutil.rmtree(dest)
if not dest.exists():
fname = download_data(url, fname=fname, c_key=c_key)
if _get_check(url) and _check_file(fname) != _get_check(url):
print(f"File downloaded is broken. Remove {fname} and try again.")
Archive(fname).extractall(dest.parent)
return dest
Also need to add this line to imports.py:
from pyunpack import Archive
And this to environment.yml in the pip section:
- pyunpack
- patool
This would enable a lot more compression types to be decompressed which I think would be good for more general use of the fastai library since there are a lot of zip files in the wild.
Thanks @KevinB. I’ve added a extract_func param now so you can easily use any function you like, including the one you found (thanks for digging that up!) Just create this func:
@KevinB and Jeremy thank you for doing this!!! I’m currently working with the COCO dataset for human keypoints and this makes my life so much easier. I can’t thank you both enough
I realize this thread is from 2 years ago but I was just now looking for how to do this and…was briefly happy at having found an answer. Seems that extract_func is no longer included so that @KevinB’s remedy is no longer viable. Any recommendations for what to do instead?
(e.g I see download_data is also gone. I can run wget or curl manually but…just imagining there’s a more fastai-y way.)