Modin to speed up CSV reading

toddy86 · October 28, 2018, 4:54am

Hi all,

A little off topic, but hopefully something of potential use.

Reading in CSVs with pandas can take a long time if the datasets start to get a little unweildly in size.

The guys and gals at Berkley have developed Modin which looks promising to significantly speed up the pandas CSV read function.

GitHub repo can be found here https://github.com/modin-project/modin

An article on the project is below too for your reading pleasure. A quick Google will yeild you more.

Todd

KevinB · October 28, 2018, 5:44am

That’s a really cool idea. Thanks for sharing. It’s cool they only make you change your import. I guess that’s another reason to use the very common defaults.

noskill · October 28, 2018, 10:28am

For faster csv reading and saving, there’s feather format builtin in pandas.
I quote from a discussion thread from kaggleNoobs slack channel -
"it’s too slow when modifying data inside. A single row modification leads to a full construction and destruction of data.
Sometimes, aggregates are giving different results (in my case usually, to the 6th decimal on lot of data). Not good when it comes to very high precision timers for instance.
Not good for RAM management also when it comes to modifying data. If you do operations globally, doesn’t really matter, but if you target specific rows you will explode RAM.

My main use cases are simulations, it’s way too slow and too much RAM hungry compared to standard pandas (both correlates very well if you queue a nearly infinite amount of work).
Also you need to check what is supported and not supported, and the limitations of each function… I know json fallbacks to pandas for instance even though it is “available” in Pandas on Ray… >_>" by laurae.

Taka · October 28, 2018, 10:33am

Agreed. But, atleast in our current use case, we mostly do global operations, so I guess this isn’t an issue ./?