A tool for easy parallelization of text processing

rob · February 27, 2018, 3:27am

Does anyone know of a good tool for parallelizing text processing, or scripts in general?

I’d like a tool that, when given a folder full of files and a script that runs on one file at a time:

Partitions the files into N relatively equally sized folders (by word count or file size), where N is the number of physical or virtual cores available
Runs the script in N parallel processes, one process for each folder
Can restart the script in case data causes it to bomb
Can set aside any error-causing data so new data can be processed
Can run on my local machine or on a set of nodes in the cloud
(optional, great-to-have) Allows you to build the pipeline starting from a visual dataflow diagram layer

I thought Apache Beam might be such a tool, but it looks like that only works on Google’s Cloud.

I wrote a bash script that can do #s 1-4 , but having gone through that, I’d prefer a fully featured tool.

I found this overview of ETL tools, and maybe Dremio can work, though I thought I’d check around before diving into all these tools.

Data parallelization is much easier than parallelizing algorithms or writing custom parallel code for all your scripts. For text processing it is very useful, particularly when using tools like StanfordCoreNLP. It is a lazy mechanism since you’re not using shared memory, but much easier to implement and maintain.