Untar_data doesn't seem to do anything if I pass in a fname and a dest

calmdownkarm · November 1, 2018, 1:12am

I’m running the fast.ai library on my own machine(ubuntu 16.04) - and I have to save datasets to a different drive than the one I’m currently running the notebook in, but when I try to pass it a dest with the path or an fname with a filename or both, the function just returns. If I don’t pass in either it works fine. my user has permission to write to that disk so I don’t think it’s a permissions error?

deke · November 2, 2018, 8:42pm

I believe I am getting a similar error to yours and am unsure how to solve it.

I have also setup the fastai library on a scientific computing setup where I don’t have access to mess around as if it were my personal setup. I have fastai v. 1.0.18 and fastprogress 0.1.15 in my conda env.

I have tried modifying the Config to dl the datasets to a different location, but also tried the same as you did by specifying the fname and destination to save the data. This just returned the path to the destination I described and saved an empty “oxford-iiit-pet.tgz”.

The exact error I get is:

---------------------------------------------------------------------------
EmptyHeaderError                          Traceback (most recent call last)
/genome/scratch/Neuroinformatics/dhoward/conda_envs/fastai_v1/lib/python3.6/tarfile.py in next(self)
   2293             try:
-> 2294                 tarinfo = self.tarinfo.fromtarfile(self)
   2295             except EOFHeaderError as e:

/genome/scratch/Neuroinformatics/dhoward/conda_envs/fastai_v1/lib/python3.6/tarfile.py in fromtarfile(cls, tarfile)
   1089         buf = tarfile.fileobj.read(BLOCKSIZE)
-> 1090         obj = cls.frombuf(buf, tarfile.encoding, tarfile.errors)
   1091         obj.offset = tarfile.fileobj.tell() - BLOCKSIZE

/genome/scratch/Neuroinformatics/dhoward/conda_envs/fastai_v1/lib/python3.6/tarfile.py in frombuf(cls, buf, encoding, errors)
   1025         if len(buf) == 0:
-> 1026             raise EmptyHeaderError("empty header")
   1027         if len(buf) != BLOCKSIZE:

EmptyHeaderError: empty header

During handling of the above exception, another exception occurred:

ReadError                                 Traceback (most recent call last)
<ipython-input-31-88e4f4086a4d> in <module>()
----> 1 path = untar_data(URLs.PETS)
  2 path

/genome/scratch/Neuroinformatics/dhoward/conda_envs/fastai_v1/lib/python3.6/site-packages/fastai/datasets.py in untar_data(url, fname, dest, data)
 93     if not dest.exists():
 94         fname = download_data(url, fname=fname)
---> 95         tarfile.open(fname, 'r:gz').extractall(dest.parent)
 96     return dest

/genome/scratch/Neuroinformatics/dhoward/conda_envs/fastai_v1/lib/python3.6/tarfile.py in open(cls, name, mode, fileobj, bufsize, **kwargs)
   1584             else:
   1585                 raise CompressionError("unknown compression type %r" % comptype)
-> 1586             return func(name, filemode, fileobj, **kwargs)
   1587 
   1588         elif "|" in mode:

/genome/scratch/Neuroinformatics/dhoward/conda_envs/fastai_v1/lib/python3.6/tarfile.py in gzopen(cls, name, mode, fileobj, compresslevel, **kwargs)
   1638 
   1639         try:
-> 1640             t = cls.taropen(name, mode, fileobj, **kwargs)
   1641         except OSError:
   1642             fileobj.close()

/genome/scratch/Neuroinformatics/dhoward/conda_envs/fastai_v1/lib/python3.6/tarfile.py in taropen(cls, name, mode, fileobj, **kwargs)
   1614         if mode not in ("r", "a", "w", "x"):
   1615             raise ValueError("mode must be 'r', 'a', 'w' or 'x'")
-> 1616         return cls(name, mode, fileobj, **kwargs)
   1617 
   1618     @classmethod

/genome/scratch/Neuroinformatics/dhoward/conda_envs/fastai_v1/lib/python3.6/tarfile.py in __init__(self, name, mode, fileobj, format, tarinfo, dereference, ignore_zeros, encoding, errors, pax_headers, debug, errorlevel, copybufsize)
   1477             if self.mode == "r":
   1478                 self.firstmember = None
-> 1479                 self.firstmember = self.next()
   1480 
   1481             if self.mode == "a":

/genome/scratch/Neuroinformatics/dhoward/conda_envs/fastai_v1/lib/python3.6/tarfile.py in next(self)
   2307             except EmptyHeaderError:
   2308                 if self.offset == 0:
-> 2309                     raise ReadError("empty file")
   2310             except TruncatedHeaderError as e:
   2311                 if self.offset == 0:

ReadError: empty file

Any help debugging is appreciated

Jdemlow · January 27, 2019, 11:57pm

@stas I didn’t want to bring Jeremy Howard into this one as I am not sure if this was resolved, but this is the same problem that I am having. Good Evening, I am working on trying to get untar_data to work as well, but I am having trouble with this as well.

This function is working with i don’t have have dest = PATH (or location that I want it to go to )

The question that I have is where is the data actually if I am using the default location. I am using a virtual conda environment, but I have searched everywhere I can think the data might be, but I couldn’t find it.

NOTE I haven’t changed the default config location, but Deke did and he has the same problem.

stas · January 28, 2019, 1:46am

Looks like a bug - please file an Issue at https://github.com/fastai/fastai/issues

I don’t use the default location either, I symlink from where I want that datasets/models to be to ~/.fastai/.

Also at some point the Config object needs to be made settable, if you need it sooner than later please open an Issue at the same place.

Jdemlow · January 28, 2019, 10:57pm

Sounds like a plan.

Do you think changing the symlink to a specific path location would be a good idea or should I leave it for now.

The only reason that I ask I know I am going to run out of space for all the images and data sets that I will be adding to practice on other data sets as Jeremy states to do for every lesson.

when you are stating i need to set the config is this what you are talking about? If so also would i just change it to jdemlow@Jdemlow:~/fastai/course-v3/nbs/dl1/data$ as this is where i would want to be able to see or delete the data if i run out of space. This would be the same thing for the model path. Also if you don’t mind if you have time to ask this would you be able to explain why our models don’t go to a tmp folder like these used to. If they do please disregard.

I was thinking about going in and using nano as i prefer that editing method, but I didn’t want to change the global variables with out it being suggested.

class Config():
    "Creates a default config file at `~/.fastai/config.yml`"
    DEFAULT_CONFIG_PATH = '~/.fastai/config.yml'
    DEFAULT_CONFIG = {
        'data_path': '~/.fastai/data',
        'model_path': '~/.fastai/models'
    }

Also thank you so much for all your work and everything you guys do.

stas · January 28, 2019, 11:22pm

For the lessons data I allocated a largish ssd/nvme disc space say at /mnt/nvme/fastai-data and so I did:

ln -s /mnt/nvme/fastai-data ~/.fastai

So now any untar_data data and the corresponding models will go into /mnt/nvme/fastai-data.

So the large data is all at one fs location, while the slim .ipynb files can be anywhere - i.e. they don’t have to be together.

Of course, if the space is getting low, move it elsewhere and fix the symlink.

For my own projects, where I don’t use untar_data and prepare data on my own, there is no problem as everything gets stored locally at the path you specified, e.g.:

$ cd myproject
$ ls -1
train.ipynb
data
$ ls -1 data
models
test
train

And you can ignore the Config object for now, since it’s not user-configurable yet.

Jdemlow · January 28, 2019, 11:37pm

Perfect and thank you for the help I really appreciate it.

To quickly clarify the ln -s (location path) ~./fastai i would run that in the jupyter notebook? if this is the case I will test that right now just to confirm that it works so that the next person will be able to use this in a similar fashion.

stas · January 28, 2019, 11:54pm

It doesn’t matter whether you run it from the console or the notebook. Unless you use some volatile online instance disc that you have to reinstall on every boot, you just need to do it once and forget about it. In my case I use my PC’s GPU card so I don’t need to redo those things on every instance boot, which you may have to (e.g. google colab, etc.).

If you run it from the notebook, remember to add ! before the command.

!ln -s /mnt/nvme/fastai-data ~/.fastai

but, of course, first check that ~/.fastai isn’t already populated, in which case you will probably want to move/delete its contents first.

Jdemlow · January 29, 2019, 1:23am

So I think i might have broken this.

(base) jdemlow@Jdemlow:~$ rm -rf  ~/.fastai
(base) jdemlow@Jdemlow:~$ ln -s /course-v3/nbs/dl1/data ~/.fastai
(base) jdemlow@Jdemlow:~$ find -L ~/.fastai
/home/jdemlow/.fastai

I think that i broke it completely

I don’t want to waste your time I know you have a lot of other people. Should I just reinstall at this point?

stas · January 29, 2019, 1:58am

No, reinstalling won’t make any difference. The problem has to do with your fs and not fastai it seems.

Your symlink in the notebook of course fails (cell 7 in your snapshot), since you already created a symlink just before that in the shell as you described above.
Moreover in the notebook you’re trying to symlink to a different path, why?
why are you symlinking to /course-v3/nbs/dl1/data? symlink to somewhere on the fs where you have lots of space instead. I suppose this is what you were doing in the notebook. In which case, do that in the console once and you’re done.
```
mkdir /some/path/with/ample/space 
mv ~/.fastai ~/.fastai-old
ln -s /some/path/with/ample/space ~/.fastai
```
of course, adjust it to be a real path
if (3) didn’t help, please paste the full backtrace of the error, not the image,

And unrelated, but important in general - it’s a good idea to create a conda environment to work in - it looks you’re using the base - non-specific environment, if you mess it up, you can’t delete it like you can any environment you create. So you do something like:

conda create python=3.7 --name fastai
conda activate fastai

adjust python to 3.6 if you prefer py36.

Jdemlow · January 30, 2019, 9:58pm

Long Story Short: Your advice worked perfectly and you really helped me out with that and now I have the data going to a separate disk.

I know I have said thank you a lot, but thank you so it’s working now and I have now gotten a better understanding of symlink and more thanks to your kind suggestions. ALSO CONGRATS on the jupyter memory leakage I saw the tweet from Jeremy Howard giving you congratulations.

That makes perfect sense.
I think I misunderstood what it was actually doing. I had to read up an symlinks and I totally get it and will be using those in my daily work from here on out.
This is exactly what i did and i symlinked it to a separate 1TB HDD and kept space on the SSD where fastai and anaconda is installed.
For those that might read this I am using a NZXT gaming computer that is dual booted with Ubuntu and I am SSH into the box from my personal lap top. I am pretty sure that this advice above is perfect for every scenario though. The reason that I had (base): was because i was using the terminal in juypter notebook. I am not completly sure why it was (base): when conda activate was active, but maybe it has to do with juypter notebook the program.

I am sorry I haven’t gotten to this I was working on a sprint to deploy my model at work and we had a ton of complications on getting their artifactory to work, but it was successful and the pega team is working to develop the GUI. I really can’t wait to see if I can get fastai in the work place, but at work I am stuck with window OPS so we are working to get a linux box so that I can do this, but the company I am doesn’t have GPU yet so that will be interesting, but with pytorch supporting CPU only now I will begin to experiment.

stas · January 30, 2019, 10:52pm

Glad to hear you got it working, @Jdemlow.

Thank you for the kind words.

nbijlani · February 6, 2019, 11:34am

Hi,

I’m struggling with using my own dataset too for Lesson 1. I have uploaded a zip file of chest x-rays (chest.zip) obtained from Kaggle, to the courses/fast-ai/course-v3/nbs/data folder. How do I proceed to unzip this file? (untar_data does not seem to do anything). I’m using the crestle.ai Jupyter environment.

Thank you for any suggestions!
Nivedita

calmdownkarm · February 6, 2019, 4:07pm

can you run unzip from the command line/jupyter notebook? then you can point the appropriate paths using Python and continue with the lesson as planned.

so you should be able to do unzip /chest.zip and or !unzip /chest.zip if you’re running the command from jupyter. That should work provided you have the unzip utility installed.

If you don’t have the unzip utility installed and don’t have sudo access, I’d recommend getting the 7zip binary from anaconda (something like https://anaconda.org/bioconda/p7zip) and using that.

nbijlani · February 8, 2019, 4:16pm

Thank you for your response. I didn’t think at the time to use the Jupyter command line for some reason! But was successful using this instead from within the notebook:

from shutil import unpack_archive
unpack_archive(’/home/nbuser/courses/fast-ai/course-v3/nbs/data/chest.zip’, ‘/home/nbuser/courses/fast-ai/course-v3/nbs/data/’)

Of course, the above does not make for reproducible code as this needs to be done only once and so is better achieved via the terminal. But as Jeremy said, better to be experimental and keep going!

Thank you again,
Nivedita

Jdemlow · February 10, 2019, 3:30pm

everything worked for ya? Sorry work last week was brutal happy studies

harris · March 14, 2019, 2:18am

One thing I just learned is that untar_data won’t do anything if you already have a folder in the data directory with that name. For example, I had to delete my imdb folder and then do an untar_data(URLs.IMDB) to download the imdb dataset. It would be great if a warning was issued by untar_data, saying that it was ignoring the download request because a folder already exists!

Prosas · March 16, 2019, 11:33am

Having same issue here (on GCP) - simply used !gzip and !tar xvf to work around it.

cjung5k · March 18, 2019, 2:23am

you can also simply rename the IMDB data folder