Most effective ways to merge "big data" on a single machine

Probably the easiest way (as mentioned earlier in the thread) is to just use SQL.

1 Like

That’s also a good solution yes :slight_smile: .

I’d also echo the SQL or Spark based approaches others mentioned - they are a great general solutions.

Apologies that I didn’t tie file formats discussion back to the main topic. One reason we cared about it at my shop is that in some pipelines we generate data really quickly and have daily hard time deadlines to finish processing by (you need to finish your data work on yesterday’s market data before the market opens today). There were a couple merge, sort and search steps in the pipeline and all the canned solutions (SQL and the like) struggled - io can become a bottleneck with some of these. In the end, the team wrote its own merging etc approaches over carefully structured binary data formats - it gave a big gains (at least an order of magnitude for speed and data size).

Definitely don’t recommend doing this most the time - but I thought it might be worth mentioning the an odd edge case.

Don’t worry about that that was really great info! :slight_smile:

Interesting, what kind of custom “tool” did they make?

Doesn’t spawning a lot of spark cluster could have speed up the whole thing?

Their solution (as much as I can really talk about it) had some analogs to batching, and became relatively specific to the workflow (ie not easily generalizable) - leveraging knowledge of the entire flow to best ensure relative data locality and optimize for bandwidths at various levels (network, disk, even cache).

I believe the cost of getting equivalent performance with SPARK in the cloud was the deal breaker there. High performance general solutions are amazing for 99+% of use cases. In specific cases you do have to drop all general-ness to get the best solution (in some cases the solution is massively valuable IP - high frequency trading is a canonical example).

1 Like