Getting images into Kaggle project from tar archives

I have image files that I want to use for a project in Kaggle. I have each images belonging to each class in separate tar files on a server. I can’t find any documentation for how to access those in my code. I did upload them as Datasets in Kaggle, but I still can’t figure out how to access them.

The only thing I found was FastDownload, using the URL on my server. But with this code (url replaced by dots):

from fastdownload import FastDownload
d = FastDownload()
drama_path = d.get('http://.../file.tgz')

I get this error:


/opt/conda/lib/python3.7/site-packages/fastdownload/core.py in get(self, url, extract_key, force)
    119             data = self.data_path(extract_key, urldest(url, self.arch_path()))
    120             if data.exists(): return data
--> 121         self.download(url, force=force)
    122         return self.extract(url, extract_key=extract_key, force=force)

/opt/conda/lib/python3.7/site-packages/fastdownload/core.py in download(self, url, force)
     94         "Download `url` to archive path, unless exists and `self.check` fails and not `force`"
     95         self.arch_path().mkdir(exist_ok=True, parents=True)
---> 96         return download_and_check(url, urldest(url, self.arch_path()), self.module, force)
     97 
     98     def rm(self, url, rm_arch=True, rm_data=True, extract_key='data'):

/opt/conda/lib/python3.7/site-packages/fastdownload/core.py in download_and_check(url, fpath, fmod, force)
     64         else: print("Downloading a new version of this dataset...")
     65     res = download_url(url, fpath)
---> 66     if not check(fmod, url, fpath): raise Exception("Downloaded file is corrupt or not latest version")
     67     return res
     68 

/opt/conda/lib/python3.7/site-packages/fastdownload/core.py in check(fmod, url, fpath)
     47 def check(fmod, url, fpath):
     48     "Check whether size and hash of `fpath` matches stored data for `url` or data is missing"
---> 49     checks = read_checks(fmod).get(url)
     50     return not checks or path_stats(fpath)==checks
     51 

/opt/conda/lib/python3.7/site-packages/fastdownload/core.py in read_checks(fmod)
     40 def read_checks(fmod):
     41     "Evaluated contents of `download_checks.py`"
---> 42     if not fmod.exists(): return {}
     43     txt = fmod.read_text()
     44     return eval(txt) if txt else {}

AttributeError: 'dict' object has no attribute 'exists'

I don’t have an answer, only some help digging…
You can get a broader view of the code at fastdownload/core.py at master · fastai/fastdownload · GitHub

The error combined with the arrow against line 42 indicates the “fmod” variable is holding a type dict.
Following fmod back through the stack leads to this definition…

which is called from…

So it would seem that: fmod <-- self.module

And thats where my python-fu ends, sorry.
Hopefully this partial analysis will make it easier for someone else to recognise.

1 Like

Thanks, @bencoman.

For the record, I found untar_data which calls FastDownload internally and did work right off the bat simply calling it as path = untar_data("http://.../file.tgz")

2 Likes