Untar_data requires .tgz file ending

(Nick Switanek) #1

untar_data requires a download url with a .tgz file ending in order to run.
The Caltech 101 data in http://course.fast.ai/datasets is a .tar.gz file.

Python’s tarfile package should be able to handle either .tgz or .tar.gz files, so the problem is I think just with the implementation of download_data in datasets.py, which expects an url without a file ending and then appends a .tgz ending.

In the short term, it’d be helpful if whoever has access to the S3 bucket would save the Caltech 101 data as a .tgz file like the rest of the image classification datasets.

2 Likes

Platform: Colab ✅
More flex data unpack
Lesson 1 Discussion ✅
(Jeremy Howard (Admin)) #2

Will do. And the current URLs class is pretty awkward, so if anyone wants to help make it cleaner (without breaking existing usage) that would be great!

1 Like

(Scott H Hawley) #3

Done. Pull request https://github.com/fastai/fastai/pull/1047: “Leave URLs already ending in .tar.gz or .tgz alone”

Modified _url2tgz(url) to check to see if there was already a “.tar.gz” or “.tgz” on the url before adding it on.

Found this thread because my url had the .tar.gz on the end of it!

1 Like

(Anirudh Sundar) #5

I’m still seeing download_data() appending ‘.tgz’ at the end. Is this change still not added to the code. Is there any way to use tar.gz URLs now using these methods?

2 Likes

#6

Yes, it looks like this issue is still not fixed.

0 Likes

#7

download_data has an ext attribute that you can set to '' if you don’t want anything added in v1.0.50

0 Likes

#8

Using untar_data has been the most frustrating experience I have encountered in quite some time. Never have I encountered an API like this that accepts a string, but then modifies the string under the hood. Absolutely bonkers. This justifies a breaking change.

0 Likes

#9

I’m not sure I understand your frustration. untar_data is a helper function to download datasets that are present in URLs.something, which it does. It’s not intended to magically work with any url you pass it.

0 Likes

Lesson 1 Discussion ✅
#10

Hey sgugger, my frustration is with the implementation and documentation of this function.

This is the function signature:

untar_data(url:str, fname:PathOrStr=None, dest:PathOrStr=None, data=True, force_download=False)

The only required parameter is a url, which is a string. If you pass in a valid URL as a string e.g. a URl from https://course.fast.ai/datasets such as “https://s3.amazonaws.com/fast-ai-nlp/yelp_review_polarity_csv.tgz” - this function dies.

I am sure I am not the only person who has run into this. If you name a string parameter “url”, people will attempt to pass in urls as strings. Moreover, there is absolutely nothing in the documentation that indicates that indicates this unusual behaviour.

There is nothing in the documentation for untar_data that indicates it is only intended for some of the hardcoded example datasets, it is not in the URLs class, and it has no type system support to indicate that it is not intended for arbitrary untarring of files.

0 Likes

Lesson 1 Discussion ✅
(Kishore Iyer) #11

I ran into the same issue while trying to train a model using my own data in lesson1 and managed to get past it using python3’s urllib.request, tarfile and Path modules. Thought I would share this here in case it helps. There may be a fastai library way to do this (?).

I am on Colab and also have a local setup where I use python3.6 and fastai 1.0.5 with an NVIDIA RTX 2080 GPU.

Basically, using lesson1’s code as a reference, I did the following things differently.

First, in the notebook cell on imports:

from pathlib import Path
import urllib.request
import tarfile

Next, instead of using untar_data, I used the following code. Note that I am not at liberty to share the exact URL I used and you should replace the url with your url to make this work. Note also that I don’t have checks in place to see if the file downloaded is a tar file or not as well as checks to see if the functions return successfully or not. These can be added easily.

url = "http://myserver/output/proj1_training_data.tar.gz"
local_tgz_path = Path("/root/.fastai/data/proj1_training_data.tar.gz")
print("Downloading from %s..." % (url,))
urllib.request.urlretrieve(url, local_tgz_path)
print("Opening using tarfile from %s..." % (local_tgz_path,))
tarred_file = tarfile.open(local_tgz_path)
tarred_file.extractall(path="/root/.fastai/data/")
tarred_file.close()
path = Path("/root/.fastai/data/proj1_training_data/")

The rest of the lesson1 notebook can remain the same and it all works out. My training data is also organized the same as the Oxford Pet data used in lesson1.

3 Likes

(Ami Aram) #12

I lost many hours trying to download my dataset .tar.gz inefficiently. It always returned the same error. ReadError: not a gzip file

Your changes worked very well for me. Except because I was giving permission error denied to the path. But I easily resolved by replacing / root / with / home / jupyter /.

Thank you so much for sharing.

1 Like

(Christina Seventikidou) #13

hello! I would like to ask, in docs is written that the locations where the data and models are downloaded are located by default in ~/.fastai. How can we change the path ?!

0 Likes

#14

You can change it in your config file, located in ~/.fastai/config.yml

0 Likes

(Christina Seventikidou) #15

I changed it like this:
-PATH=’/home/jupyter’ #here is my path
-path=untar_data(URLs.PETS, dest=PATH)
path
PosixPath(’/home/jupyter/oxford-iiit-pet’)
path.ls()
[PosixPath(’/home/jupyter/oxford-iiit-pet/images’), PosixPath(’/home/jupyter/oxford-iiit-pet/annotations’)]
And all the images have been downloaded there so now I can see all the data(because I could not find them anywhere with the default location) without change the default config location for now.
Thank you very much for your time!:slight_smile:

0 Likes

#16

I appreciate the work that goes into building this, but the API could be designed in a more human friendly way. Nothing about the function’s name or its error message helps. It’s easy to burn hours on things like this rather than learning AI.

This is a good mentality for designing API: https://blog.codinghorror.com/falling-into-the-pit-of-success/

0 Likes

(Nicholas Wickman) #17

The first thing I did after playing with some numbers was plug in a url to a .tar.gz dataset hoping to inspect what it looked like, and ran into this error (and was led to this thread).

The helper function is introduced in the same lecture as underlining how easy and convenient the fast.ai library makes experimentation, and explaining the most common formats of publicly available datasets, so it’s understandable that people think they can plug in a .tar URL and start examining what they got.

-The function lists a URL as a parameter which receives a string.
-The existing variable is “URLs.something”, which my first assumption was that there is just a dict of URLs somewhere.
-The function is simply named untar_data.

It could be clearer that this isn’t a general purpose utility.

0 Likes

(chao wang) #18

I don’t know why but it do works!
I delete the tail(such as .tgz) of url,and it run successfully.

0 Likes