09_tabular kaggle api not working - w/paperspace

bbrown · May 19, 2022, 1:21am

So I’ve read through many of the threads about configuring the kaggle api and I am not able to get 09_tabular to read the data.

Here is what I have done.
-kaggle.json file is uploading

when I tried to run the following code:

df = pd.read_csv(path/‘TrainAndValid.csv’, low_memory=False)

I get the following error

---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
/tmp/ipykernel_581/390870293.py in <module>
----> 1 df = pd.read_csv(path/'TrainAndValid.csv', low_memory=False)

/opt/conda/lib/python3.7/site-packages/pandas/util/_decorators.py in wrapper(*args, **kwargs)
    309                     stacklevel=stacklevel,
    310                 )
--> 311             return func(*args, **kwargs)
    312 
    313         return wrapper

/opt/conda/lib/python3.7/site-packages/pandas/io/parsers/readers.py in read_csv(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, encoding_errors, dialect, error_bad_lines, warn_bad_lines, on_bad_lines, delim_whitespace, low_memory, memory_map, float_precision, storage_options)
    584     kwds.update(kwds_defaults)
    585 
--> 586     return _read(filepath_or_buffer, kwds)
    587 
    588 

/opt/conda/lib/python3.7/site-packages/pandas/io/parsers/readers.py in _read(filepath_or_buffer, kwds)
    480 
    481     # Create the parser.
--> 482     parser = TextFileReader(filepath_or_buffer, **kwds)
    483 
    484     if chunksize or iterator:

/opt/conda/lib/python3.7/site-packages/pandas/io/parsers/readers.py in __init__(self, f, engine, **kwds)
    809             self.options["has_index_names"] = kwds["has_index_names"]
    810 
--> 811         self._engine = self._make_engine(self.engine)
    812 
    813     def close(self):

/opt/conda/lib/python3.7/site-packages/pandas/io/parsers/readers.py in _make_engine(self, engine)
   1038             )
   1039         # error: Too many arguments for "ParserBase"
-> 1040         return mapping[engine](self.f, **self.options)  # type: ignore[call-arg]
   1041 
   1042     def _failover_to_python(self):

/opt/conda/lib/python3.7/site-packages/pandas/io/parsers/c_parser_wrapper.py in __init__(self, src, **kwds)
     49 
     50         # open handles
---> 51         self._open_handles(src, kwds)
     52         assert self.handles is not None
     53 

/opt/conda/lib/python3.7/site-packages/pandas/io/parsers/base_parser.py in _open_handles(self, src, kwds)
    227             memory_map=kwds.get("memory_map", False),
    228             storage_options=kwds.get("storage_options", None),
--> 229             errors=kwds.get("encoding_errors", "strict"),
    230         )
    231 

/opt/conda/lib/python3.7/site-packages/pandas/io/common.py in get_handle(path_or_buf, mode, encoding, compression, memory_map, is_text, errors, storage_options)
    705                 encoding=ioargs.encoding,
    706                 errors=errors,
--> 707                 newline="",
    708             )
    709         else:

FileNotFoundError: [Errno 2] No such file or directory: '/storage/archive/bluebook/TrainAndValid.csv'```




banging my head on how to try to fix this :(

bencoman · May 19, 2022, 5:33am

What does the following show?

!ls /storage/archive/bluebook

bbrown · May 20, 2022, 12:58am

I got the same error as before

n-e-w · May 20, 2022, 2:44am

Almost certainly, you haven’t correctly extracted the csv's from the zip file. So when you try to read in the TrainAndValid.csv , it’s not there – just the zip file you originally downloaded.

In your notebook (assuming your notebook in the same directory as the downloaded zip file), try:

! unzip bluebook-for-bulldozers.zip

bbrown · May 21, 2022, 1:20am

When I ran the above code this happen.

mike.moloch · May 21, 2022, 2:01am

That’s weird because in my kaggle kernel both zip and unzip exist at /usr/bin/

So, if I do

!ls -l /usr/bin/unzip

it lists the file which means it exists.

EDIT: I read “kaggle” in the original post and assumed it was being run in a kaggle notebook. My bad! You should do as Bencoman and Nick suggested. It just means your linux / os doesn’t have that command installed.

bencoman · May 21, 2022, 2:08am

I think you’ll gain more if I teach you how to fish…

Googling your error message: bin bash unzip command not found
returns this in the top five… https://command-not-found.com/unzip

Which then begs the question: how do I tell what version of linux I’m on
which finds this… https://www.cyberciti.biz/faq/find-linux-distribution-name-version-number/

noting that where you see a dollar-sign"$" shell prompt, from within a Notebook shell commands are preceded with an exclamation “!”.

n-e-w · May 21, 2022, 2:08am

Which platform are you on? If Linux, on the command line you can install the unzip utility (or via ! alias in the notebook)

sudo apt-get install unzip

And then you should be able to use the unzip command in the notebook

bencoman · May 21, 2022, 2:15am

That presumes $PATH is good (which it should be). More definitive is…

!which unzip

which will report…
/usr/bin/unzip
if its found there in $PATH and blank if not found.

bbrown · May 21, 2022, 2:40am

on so I uploaded the zip file and ensured that my code can view it.

I’m still having a hard time figuring out exactly the correct python code to extract.
The concept of directing paths with paperspace is still somewhat a hard concept for me to understand. I understand paths on my CPU but not really on my GPU via Paperspace.

bencoman · May 21, 2022, 2:45am

@bbrown, I haven’t used Paperspace, but I need to soon, so this looks like a good opportunity for some goal directed learning. Can you link the whole notebook you are using that I can clone?

bbrown · May 21, 2022, 2:51am

I’m using the code from the fastai github

github.com

fastai/fastbook/blob/master/09_tabular.ipynb

{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#hide\n",
    "! [ -e /content ] && pip install -Uqq fastbook kaggle waterfallcharts treeinterpreter dtreeviz\n",
    "import fastbook\n",
    "fastbook.setup_book()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "hide_input": false
   },

This file has been truncated. show original

is this what you need?

mike.moloch · May 21, 2022, 2:55am

There’s nothing GPU specific on Paperspace when it comes to file handling. They make a virtual machine available which runs linux. That VM just has access to a GPU for when your pytorch code needs it. All the path management happens at Linux level so the commands you’re familiar with on your local machine should work the same (assuming you’re also using a version of Linux.)

btw os.listdir() and !ls do exactly the same thing.

bbrown · May 21, 2022, 4:59pm

@bencoman interested to see if you can configure the paperspace environment to pull in Kaggle datasets via the API. Many of the solutions provided in this thread are grounded in some kind of “work-around” of the template provided in the 09_tabular FastIA workbook.

I hope to do more experimenting with the Kaggle competitions and would like to develop a solution that leverages using Kaggle’s API to get and open training and testing datasets.

Any help in developing this would be greatly appreciated!!!
~Bryon

bencoman · May 25, 2022, 4:58pm

@bbrown, I think I got it sorted. This is my first time using Paperspace, so I ran the notebook in parallel in Colab. I present a detailed action log for other newcomers following along…

Signed up and created a notebook…

image511×690 53 KB

image340×649 27.6 KB
The first two cells…
in Colab worked okay, but
in Paperspace failed as shown here, until green text cell-1-line-2 was deleted.

image1120×695 87.7 KB
Followed instructions to get kaggle api key…

image1121×118 19.4 KB

image1547×267 22.6 KB
Key installation worked fine on both Colab and Paperspace…

image1158×542 25.3 KB
Now trying the API download, first on Colab. Here I diverged from you, I had this additional hurdle… (I’m yet to do my first kaggle competetition so didn’t know about accepting competition conditions.) [addenum: whoops! I just noticed my failure to read the instructions to do so, since I skimmed to try quickly replicating where you got up to.]

image1221×814 55.7 KB

.
Fixed by clicking the manual download button on the kaggle site,

image1063×671 40.9 KB

.
and then accepting conditions…

image1181×508 35.7 KB
Trying again, on Paperspace this time, the API download worked…

image1166×387 32.7 KB

.
But Colab didn’t…!!! It took me a while to work it out,
since its subtle and really only since I ran Paperspace in parallel,
did I notice the difference in the output of path.ls()

Its an empty list [] compared to the one further up,
meaning there are no files in the directory, hence the follwoing failure.

image1083×413 21.2 KB

Analysis

Bryon, Not sure if your situation is identical, but the symptoms are similar. Here is a breakdown of my situation…

The first call of the api failed after the path folder had been created.
Thus the subsequent runs of the cell never again tries the api download.

Remediation

This would seem to be fairly common occurrence for newbies (like myself),
to miss accepting competition terms (in spite of the notebook instructing them to do so)
or they mess up their API key.

A more robust test would be against the actual file being loaded, which can only exist after a successful download and unpack, as I’ve underlined here for you to try…

[Edit:]
Jeremy, here is a PR for that change…
https://github.com/fastai/fastbook/pull/514/files

bbrown · May 27, 2022, 12:13am

@bencoman this was the exact issue I was running up against. It must be the terms that I did not accept initially. Thank you so much for taking the time and hacking it out!!!