Anyone explored Rapids.ai yet?


(Even Oldridge) #1

Nvidia just released a new library that provides gpu support for data processing using pandas syntax. Seems like it’s really powerful and could be a good addition to the library / notebooks.

I haven’t evaluated it yet, and I’m curious if anyone has explored it?


(Vladislav) #2

While I am very enthusiastic about GPU-enabled computations, I am skeptical about relying on proprietary tools when this can be avoided. This tool (which, I suppose is very good in terms of performance and quality) means you will probably need a powerful enough GPU with lots of memory for routine data analysis.
If you do need some speedup, you can utilize multiple cores of your CPU using Dask.


(Nikolay Tolstokulakov) #3

It is open sourced, and currently, you can not eliminate proprietary CUDA or Google TPU from high-performance machine learning pipelines.

BTW It is panda-like, but not 100% compatible.


(Marc Rostock) #4

I wanted to explore it, but until 6h ago it was all only marketing. I was at the nvidia conference where they announced it, and the webpage was up and running, but there was actually nothing to test yet. The instructions said to download the docker containers, but those were only pushed today and the official repo was not up to date. So nobody can have really tested that yet :wink:

Having said that, I intend to test it a little and see what speedups are actually possible and how it feels compared to pandas.


(Marc Rostock) #5

a whopping 11 people have pulled the docker image so far (probably including their internal tests…)

https://hub.docker.com/u/rapidsai/

and now it’s 12 pulls, that was me :wink:
And this seems to be very much still under development. 5GB docker container??? Not quite as lightweight as pip install pandas :rofl:

EDIT: to be fair, 2.3GB of the 5GB mentioned are tar.gz files with the demo data…


#6

Wes McKinney is tied into the project through the Arrow columnar format, so hopefully the Pandas compatibility improves over time.


(Marc Rostock) #7

So, I played around with it for an hour with the official docker image. In this hour, I have only unsuccessfully tested the read_csv function.

My impression is that this is still very much alpha and I will be wasting much more time preparing and bug tracking than I will currently save due to the possible GPU acceleration. The error messages are quite unhelpful as you can see below, I cannot even get the simple csvs to load. Of course that might be me, but with pandas all of this is a oneliner that simply works. Documentation is also lacking, I can’t even find an official list of the available datatypes. Another little thing: I am currently not able to use pathlib.Path for the file access (Error because they expect the filepath to be of type string apparently).

One observation: Currently the read_csv function alone lacks many of the features that pandas has, but that is to be expected. But one of the most annoying missing features is the automatic datatype inference. With gdf you have to specify the column names and column datatypes by hand, which means you first have to load everything with pandas to make this a semi automated process (unless you have only a small number of cols and want to do this by hand). While I might do the dtype-specification for pandas sometimes as a speed optimization, this is very annoying to have to be done generally. I have read though that this feature is in the pipelines.

So, I am stopping my experiments for now. Happy to see if any of you get some more interesting results or hints on what my problems could be.

If someone wants to try, this is the way I semiautomatically created the colums-type lists (maybe there is a bug here?)

df_raw = pd.read_csv(p/'ts_testdata.csv', nrows=10)
typedf = pd.DataFrame(index=df_raw.columns)
typedf['type'] = None
for col in df_raw.columns:
    typedf['type'].loc[col] = str(df_raw[col].dtype)

typedf.loc[typedf['type']=='object', 'type'] = 'category'
typedf.loc['time'] = 'date'


(Even Oldridge) #8

Thanks for sharing your impressions. Sounds like it’s a little too alpha right now to really be useful. I’ll be curious to see if they can make it easier to use. Until they do I don’t think i’ll be exploring it.


#9

Getting closer - conda installation of cuDF v0.2.0 released yesterday:

conda install -c numba -c conda-forge -c rapidsai -c defaults cudf=0.2.0

https://t.co/FVlFOldKJa


(Christian Werner) #10

Just attended a local meetup where Nvidia also presented rapids.ai stuff (cuML, cuDF). Super intrigued… Also their docker architecture seems real nice.

One thing I was wondering is: how big is the difference to pandas, sklearn? Are we talking 80%, 90% of the imoortant stuff (minus some edge cases)…?

Reading about read_csv crapping out is kind of a bummer though :frowning:

Also, would there be a seamless solution to work on non GPU-machines with small data locally and then simply push to AWS, GCP with loads of GPUs for the big stuff?!?


(Joshua Patterson) #11

Hi all, I’m knew here so I’m not sure if this thread is still active, but I’m the director of engineering at NVIDIA responsible for RAPIDS. We’re actively working on fixing the issues discussed here, and the project is fairly new.
We are not at feature parity with pandas nor sklearn, but our goal is to continue to add features and improve usability. Reading csv on GPU is not trivial, and I think you will be happy with our most recent versions which also has better string support.

We have new containers, conda packaging, will add pip soon, and did a massive refactor to make our repos cleaner and easier to fix and contribute to. I would love it if you all took a moment to try it out, and report any issues you find. We will continue to push ahead, but it’s not possible without end user support from people such as you all.

Thanks!


(Even Oldridge) #12

Hey Joshua,

Great to have you here! I’ve been following the progress on twitter and I’ve been impressed with the pace that the team has been pushing these updates. The conda installs are really helpful, and I’ve got it running on my machine but have yet to work it into an application. I definitely see the need; I do most of my preprocessing in my ETL currently and this would be a big step up. One thing that might help push adoption is the development of some notebooks that use the library and demonstrates it’s impact directly. Jeremy uses notebooks quite effectively to spread the fastai usage and it’s been really successful.

I’ll download 0.4 and try to give it a spin sometime later this week. I’m currently on pat leave with a newborn so it might take a while to dive into it fully though, but thanks for the amazing tool and for connecting with the community.

regards
Even


(Even Oldridge) #13

Really enjoyed your NeurIPS presentation as well. I’ll link here for those looking for more on the topic of rapids: