Simple bash-parallel commands in python

One of the benefits of using a primitive system like collections of flat files for data storage is the ability to trivially do work in parallel on them through the shell. This seems to be a relatively common workflow in both computational and data science. A quick Google search on the topic reveals a number of people asking about this on StackOverflow and an assortment of tools including GNU Parallel at a base level and more sophisticated workflow tools like Fireworks to address similar issues. These tools are great and much more full featured than what I’m about to offer, however sometimes you want to stay within the Python ecosystem while not introducing too many dependencies.

For those reasons, the following short code snippet (and variants thereof) has become surprisingly common in some of my prototyping.

#Process level parallelism for shell commands
import glob
import subprocess as sp
import multiprocessing as mp

def work(in_file):
    """Defines the work unit on an input file"""
    sp.call(['program', '{}'.format(in_file), 'other', 'arguments'])
    return 0

if __name__ == '__main__':
    #Specify files to be worked with typical shell syntax and glob module
    file_path = './*.data'
    tasks = glob.glob(file_path)
    
    #Set up the parallel task pool to use all available processors
    count = mp.cpu_count()
    pool = mp.Pool(processes=count)

    #Run the jobs
    pool.map(work, tasks)

This simple example runs a fictional program called “program” on all the files in the specified path, with any ‘other’ ‘arguments’ you might want as separate items in a list. I like it because it’s trivial to modify and include in a more complex python workflow. Moreover it automatically manages the number of tasks/files assigned to the processors so you don’t have to worry too much about different files taking different amounts of time or weird bash hacks to take advantage of parallelism. It is, of course, up to the user to make sure this is used in cases where correctness won’t be affected by running in parallel, but exploiting simple parallelism is an trivial way to turn a week of computation into a day on a modern desktop with 12 cores. Hopefully someone else finds this useful, and if you have your own solution to this problem, feel free to share it here!