Untar_data requires .tgz file ending

untar_data requires a download url with a .tgz file ending in order to run.
The Caltech 101 data in http://course.fast.ai/datasets is a .tar.gz file.

Python’s tarfile package should be able to handle either .tgz or .tar.gz files, so the problem is I think just with the implementation of download_data in datasets.py, which expects an url without a file ending and then appends a .tgz ending.

In the short term, it’d be helpful if whoever has access to the S3 bucket would save the Caltech 101 data as a .tgz file like the rest of the image classification datasets.

3 Likes

Will do. And the current URLs class is pretty awkward, so if anyone wants to help make it cleaner (without breaking existing usage) that would be great!

1 Like

Done. Pull request https://github.com/fastai/fastai/pull/1047: “Leave URLs already ending in .tar.gz or .tgz alone”

Modified _url2tgz(url) to check to see if there was already a “.tar.gz” or “.tgz” on the url before adding it on.

Found this thread because my url had the .tar.gz on the end of it!

1 Like

I’m still seeing download_data() appending ‘.tgz’ at the end. Is this change still not added to the code. Is there any way to use tar.gz URLs now using these methods?

2 Likes

Yes, it looks like this issue is still not fixed.

download_data has an ext attribute that you can set to '' if you don’t want anything added in v1.0.50

Using untar_data has been the most frustrating experience I have encountered in quite some time. Never have I encountered an API like this that accepts a string, but then modifies the string under the hood. Absolutely bonkers. This justifies a breaking change.

I’m not sure I understand your frustration. untar_data is a helper function to download datasets that are present in URLs.something, which it does. It’s not intended to magically work with any url you pass it.

Hey sgugger, my frustration is with the implementation and documentation of this function.

This is the function signature:

untar_data(url:str, fname:PathOrStr=None, dest:PathOrStr=None, data=True, force_download=False)

The only required parameter is a url, which is a string. If you pass in a valid URL as a string e.g. a URl from https://course.fast.ai/datasets such as “https://s3.amazonaws.com/fast-ai-nlp/yelp_review_polarity_csv.tgz” - this function dies.

I am sure I am not the only person who has run into this. If you name a string parameter “url”, people will attempt to pass in urls as strings. Moreover, there is absolutely nothing in the documentation that indicates that indicates this unusual behaviour.

There is nothing in the documentation for untar_data that indicates it is only intended for some of the hardcoded example datasets, it is not in the URLs class, and it has no type system support to indicate that it is not intended for arbitrary untarring of files.

3 Likes

I ran into the same issue while trying to train a model using my own data in lesson1 and managed to get past it using python3’s urllib.request, tarfile and Path modules. Thought I would share this here in case it helps. There may be a fastai library way to do this (?).

I am on Colab and also have a local setup where I use python3.6 and fastai 1.0.5 with an NVIDIA RTX 2080 GPU.

Basically, using lesson1’s code as a reference, I did the following things differently.

First, in the notebook cell on imports:

from pathlib import Path
import urllib.request
import tarfile

Next, instead of using untar_data, I used the following code. Note that I am not at liberty to share the exact URL I used and you should replace the url with your url to make this work. Note also that I don’t have checks in place to see if the file downloaded is a tar file or not as well as checks to see if the functions return successfully or not. These can be added easily.

url = "http://myserver/output/proj1_training_data.tar.gz"
local_tgz_path = Path("/root/.fastai/data/proj1_training_data.tar.gz")
print("Downloading from %s..." % (url,))
urllib.request.urlretrieve(url, local_tgz_path)
print("Opening using tarfile from %s..." % (local_tgz_path,))
tarred_file = tarfile.open(local_tgz_path)
tarred_file.extractall(path="/root/.fastai/data/")
tarred_file.close()
path = Path("/root/.fastai/data/proj1_training_data/")

The rest of the lesson1 notebook can remain the same and it all works out. My training data is also organized the same as the Oxford Pet data used in lesson1.

9 Likes

I lost many hours trying to download my dataset .tar.gz inefficiently. It always returned the same error. ReadError: not a gzip file

Your changes worked very well for me. Except because I was giving permission error denied to the path. But I easily resolved by replacing / root / with / home / jupyter /.

Thank you so much for sharing.

1 Like

hello! I would like to ask, in docs is written that the locations where the data and models are downloaded are located by default in ~/.fastai. How can we change the path ?!

You can change it in your config file, located in ~/.fastai/config.yml

I changed it like this:
-PATH=’/home/jupyter’ #here is my path
-path=untar_data(URLs.PETS, dest=PATH)
path
PosixPath(’/home/jupyter/oxford-iiit-pet’)
path.ls()
[PosixPath(’/home/jupyter/oxford-iiit-pet/images’), PosixPath(’/home/jupyter/oxford-iiit-pet/annotations’)]
And all the images have been downloaded there so now I can see all the data(because I could not find them anywhere with the default location) without change the default config location for now.
Thank you very much for your time!:slight_smile:

I appreciate the work that goes into building this, but the API could be designed in a more human friendly way. Nothing about the function’s name or its error message helps. It’s easy to burn hours on things like this rather than learning AI.

This is a good mentality for designing API: https://blog.codinghorror.com/falling-into-the-pit-of-success/

1 Like

The first thing I did after playing with some numbers was plug in a url to a .tar.gz dataset hoping to inspect what it looked like, and ran into this error (and was led to this thread).

The helper function is introduced in the same lecture as underlining how easy and convenient the fast.ai library makes experimentation, and explaining the most common formats of publicly available datasets, so it’s understandable that people think they can plug in a .tar URL and start examining what they got.

-The function lists a URL as a parameter which receives a string.
-The existing variable is “URLs.something”, which my first assumption was that there is just a dict of URLs somewhere.
-The function is simply named untar_data.

It could be clearer that this isn’t a general purpose utility.

1 Like

I don’t know why but it do works!
I delete the tail(such as .tgz) of url,and it run successfully.

what is meaning of untar data

Hii @root0439
Is it opening even now?
Cause I opened it the same way you did yesterday, but it’s not working today. :confused:
This is the error i get :


Kindly help
Thanks,

Thank you for posting this @rkishore

I adapted your solution to download the data needed to set up for the lesson6-rossmann jupyter notebook which you access by running rossman_data_clean. rossman_data_clean states that you should simply untar the data and go from there. Unfortunately, the data provided (http://files.fast.ai/part2/lesson14/rossmann.tgz) was throwing an error that it wasn’t a gzip file when I was trying to run untar_data. Your solution worked perfectly and I am able to finally run this step as a result!

PATH=Config().data_path()/Path('rossmann/')
table_names = ['train', 'store', 'store_states', 'state_names', 'googletrend', 'weather', 'test']
tables = [pd.read_csv(PATH/f'{fname}.csv', low_memory=False) for fname in table_names]
train, store, store_states, state_names, googletrend, weather, test = tables
len(train),len(test)

The code that I ended up running at the start of the notebook to download & unzip the file successfully was as follows:

PATH=Config().data_path()/Path('rossmann/')
table_names = ['train', 'store', 'store_states', 'state_names', 'googletrend', 'weather', 'test']
tables = [pd.read_csv(PATH/f'{fname}.csv', low_memory=False) for fname in table_names]
train, store, store_states, state_names, googletrend, weather, test = tables
len(train),len(test)

url = "http://files.fast.ai/part2/lesson14/rossmann.tgz"
local_tgz_path = "/home/jupyter/.fastai/data/rossmann/rossmann.tgz"
print("Downloading from %s..." % (url,))
urllib.request.urlretrieve(url, local_tgz_path)
print("Opening using tarfile from %s..." % (local_tgz_path,))
tarred_file = tarfile.open(local_tgz_path)
tarred_file.extractall(path="/home/jupyter/.fastai/data/rossmann")
tarred_file.close()
path = Path("/home/jupyter/.fastai/data/rossmann/")

For anyone wondering, I’m running my cloud entirely in the cloud on GCP.