09_tabular kaggle api not working - w/paperspace

So I’ve read through many of the threads about configuring the kaggle api and I am not able to get 09_tabular to read the data.

Here is what I have done.
-kaggle.json file is uploading

when I tried to run the following code:

df = pd.read_csv(path/‘TrainAndValid.csv’, low_memory=False)

I get the following error

---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
/tmp/ipykernel_581/390870293.py in <module>
----> 1 df = pd.read_csv(path/'TrainAndValid.csv', low_memory=False)

/opt/conda/lib/python3.7/site-packages/pandas/util/_decorators.py in wrapper(*args, **kwargs)
    309                     stacklevel=stacklevel,
    310                 )
--> 311             return func(*args, **kwargs)
    312 
    313         return wrapper

/opt/conda/lib/python3.7/site-packages/pandas/io/parsers/readers.py in read_csv(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, encoding_errors, dialect, error_bad_lines, warn_bad_lines, on_bad_lines, delim_whitespace, low_memory, memory_map, float_precision, storage_options)
    584     kwds.update(kwds_defaults)
    585 
--> 586     return _read(filepath_or_buffer, kwds)
    587 
    588 

/opt/conda/lib/python3.7/site-packages/pandas/io/parsers/readers.py in _read(filepath_or_buffer, kwds)
    480 
    481     # Create the parser.
--> 482     parser = TextFileReader(filepath_or_buffer, **kwds)
    483 
    484     if chunksize or iterator:

/opt/conda/lib/python3.7/site-packages/pandas/io/parsers/readers.py in __init__(self, f, engine, **kwds)
    809             self.options["has_index_names"] = kwds["has_index_names"]
    810 
--> 811         self._engine = self._make_engine(self.engine)
    812 
    813     def close(self):

/opt/conda/lib/python3.7/site-packages/pandas/io/parsers/readers.py in _make_engine(self, engine)
   1038             )
   1039         # error: Too many arguments for "ParserBase"
-> 1040         return mapping[engine](self.f, **self.options)  # type: ignore[call-arg]
   1041 
   1042     def _failover_to_python(self):

/opt/conda/lib/python3.7/site-packages/pandas/io/parsers/c_parser_wrapper.py in __init__(self, src, **kwds)
     49 
     50         # open handles
---> 51         self._open_handles(src, kwds)
     52         assert self.handles is not None
     53 

/opt/conda/lib/python3.7/site-packages/pandas/io/parsers/base_parser.py in _open_handles(self, src, kwds)
    227             memory_map=kwds.get("memory_map", False),
    228             storage_options=kwds.get("storage_options", None),
--> 229             errors=kwds.get("encoding_errors", "strict"),
    230         )
    231 

/opt/conda/lib/python3.7/site-packages/pandas/io/common.py in get_handle(path_or_buf, mode, encoding, compression, memory_map, is_text, errors, storage_options)
    705                 encoding=ioargs.encoding,
    706                 errors=errors,
--> 707                 newline="",
    708             )
    709         else:

FileNotFoundError: [Errno 2] No such file or directory: '/storage/archive/bluebook/TrainAndValid.csv'```




banging my head on how to try to fix this :(

What does the following show?

!ls /storage/archive/bluebook

I got the same error as before

Almost certainly, you haven’t correctly extracted the csv's from the zip file. So when you try to read in the TrainAndValid.csv , it’s not there – just the zip file you originally downloaded.

In your notebook (assuming your notebook in the same directory as the downloaded zip file), try:

! unzip bluebook-for-bulldozers.zip

2 Likes

When I ran the above code this happen.

That’s weird because in my kaggle kernel both zip and unzip exist at /usr/bin/

So, if I do

!ls -l /usr/bin/unzip

it lists the file which means it exists.

EDIT: I read “kaggle” in the original post and assumed it was being run in a kaggle notebook. My bad! You should do as Bencoman and Nick suggested. It just means your linux / os doesn’t have that command installed.

1 Like

I think you’ll gain more if I teach you how to fish…

Googling your error message: bin bash unzip command not found
returns this in the top five… https://command-not-found.com/unzip

Which then begs the question: how do I tell what version of linux I’m on
which finds this… https://www.cyberciti.biz/faq/find-linux-distribution-name-version-number/

noting that where you see a dollar-sign"$" shell prompt, from within a Notebook shell commands are preceded with an exclamation “!”.

2 Likes

Which platform are you on? If Linux, on the command line you can install the unzip utility (or via ! alias in the notebook)

sudo apt-get install unzip

And then you should be able to use the unzip command in the notebook

2 Likes

That presumes $PATH is good (which it should be). More definitive is…

!which unzip

which will report…
/usr/bin/unzip
if its found there in $PATH and blank if not found.

1 Like

on so I uploaded the zip file and ensured that my code can view it.

I’m still having a hard time figuring out exactly the correct python code to extract.
The concept of directing paths with paperspace is still somewhat a hard concept for me to understand. I understand paths on my CPU but not really on my GPU via Paperspace.

@bbrown, I haven’t used Paperspace, but I need to soon, so this looks like a good opportunity for some goal directed learning. Can you link the whole notebook you are using that I can clone?

I’m using the code from the fastai github

is this what you need?

There’s nothing GPU specific on Paperspace when it comes to file handling. They make a virtual machine available which runs linux. That VM just has access to a GPU for when your pytorch code needs it. All the path management happens at Linux level so the commands you’re familiar with on your local machine should work the same (assuming you’re also using a version of Linux.)

btw os.listdir() and !ls do exactly the same thing.

2 Likes

@bencoman interested to see if you can configure the paperspace environment to pull in Kaggle datasets via the API. Many of the solutions provided in this thread are grounded in some kind of “work-around” of the template provided in the 09_tabular FastIA workbook.

I hope to do more experimenting with the Kaggle competitions and would like to develop a solution that leverages using Kaggle’s API to get and open training and testing datasets.

Any help in developing this would be greatly appreciated!!!
~Bryon

@bbrown, I think I got it sorted. This is my first time using Paperspace, so I ran the notebook in parallel in Colab. I present a detailed action log for other newcomers following along…

  1. Signed up and created a notebook…


  2. The first two cells…
    in Colab worked okay, but
    in Paperspace failed as shown here, until green text cell-1-line-2 was deleted.

  3. Followed instructions to get kaggle api key…


  4. Key installation worked fine on both Colab and Paperspace…

  5. Now trying the API download, first on Colab. Here I diverged from you, I had this additional hurdle… (I’m yet to do my first kaggle competetition so didn’t know about accepting competition conditions.) [addenum: whoops! I just noticed my failure to read the instructions to do so, since I skimmed to try quickly replicating where you got up to.]


    .
    Fixed by clicking the manual download button on the kaggle site,

    .
    and then accepting conditions…

  6. Trying again, on Paperspace this time, the API download worked…


    image
    .
    But Colab didn’t…!!! It took me a while to work it out,
    since its subtle and really only since I ran Paperspace in parallel,
    did I notice the difference in the output of path.ls()
    image
    Its an empty list [] compared to the one further up,
    meaning there are no files in the directory, hence the follwoing failure.

Analysis

Bryon, Not sure if your situation is identical, but the symptoms are similar. Here is a breakdown of my situation…

The first call of the api failed after the path folder had been created.
Thus the subsequent runs of the cell never again tries the api download.

Remediation

This would seem to be fairly common occurrence for newbies (like myself),
to miss accepting competition terms (in spite of the notebook instructing them to do so)
or they mess up their API key.

A more robust test would be against the actual file being loaded, which can only exist after a successful download and unpack, as I’ve underlined here for you to try…

image

[Edit:]
Jeremy, here is a PR for that change…
https://github.com/fastai/fastbook/pull/514/files

2 Likes

@bencoman this was the exact issue I was running up against. It must be the terms that I did not accept initially. Thank you so much for taking the time and hacking it out!!!

1 Like