Multi-Threaded Processing Using xargs and Parallel

Summary:
Speed up scripts with parallel processing.


When working with command-line scripts, tasks like file processing, data manipulation, and network operations can often run slowly when executed sequentially. Fortunately, simple and powerful tools like xargs and GNU parallel can help you fully utilize your CPU by running commands in parallel. In this post, you'll learn how to harness multi-threaded processing in your scripts using these handy utilities.


Why Parallel Processing?

Most computers today have multi-core CPUs. If your script processes files or data one-by-one, it’ll only use a single core, leaving the rest idle. Parallel processing splits work across multiple cores, dramatically reducing runtime for many workloads.

Example: Processing 100 files with a single-threaded script on a quad-core CPU takes 100 seconds. Processing in parallel with 4 threads could reduce it to about 25 seconds (assuming no I/O bottlenecks).


Introduction to xargs

xargs builds and executes command lines from standard input. It can speed up execution by running multiple commands in batch, especially when used with its parallel options.

Basic Usage

find . -type f -name '*.txt' | xargs grep "search_term"

This finds all .txt files and searches for "search_term" within them.

Enabling Parallelism: -P Option

xargs can run multiple commands in parallel using -P:

cat files.txt | xargs -n 1 -P 4 my_script.sh
  • -n 1 tells xargs to use one argument per command (here, each line from files.txt).
  • -P 4 tells xargs to run up to 4 processes at once, utilizing four CPU cores.

Example: Compressing multiple files:

ls *.log | xargs -n 1 -P 8 gzip

Eight files will be compressed at the same time if you have at least eight CPU cores.


Introducing GNU parallel

GNU parallel is a powerful alternative to xargs with a more intuitive syntax and advanced features. It can replace complicated loops and manage jobs efficiently.

Basic Usage

cat jobs.txt | parallel echo

This runs the echo command for each line in jobs.txt, with as many jobs in parallel as the CPU has cores by default.

Limiting Number of Jobs

cat files.txt | parallel -j 4 my_script.sh

Run my_script.sh on each line from files.txt using 4 parallel jobs.

Example: Download multiple files in parallel

cat urls.txt | parallel -j 8 wget

This runs 8 downloads at the same time, using wget for each URL in urls.txt.

Passing Multiple Arguments

Suppose your script accepts two arguments:

parallel my_script.sh {1} {2} ::: arg1a arg1b ::: arg2a arg2b

This will execute:

  • my_script.sh arg1a arg2a
  • my_script.sh arg1a arg2b
  • my_script.sh arg1b arg2a
  • my_script.sh arg1b arg2b

xargs vs. parallel: Which to Use?

Feature xargs GNU parallel
Installed by default Often Rarely
Parallel processing Yes (-P) Yes (default)
Advanced argument handling Basic Advanced
Output order control No Yes (--keep-order)
Job log/resume No Yes
  • xargs is great for simple, fast parallelism with default utilities.
  • parallel is better for complex scenarios and advanced output handling.

Real-World Examples

Example 1: Parallel Image Conversion

Convert all .png images to .jpg using 8 threads with xargs:

ls *.png | xargs -n 1 -P 8 -I {} convert {} {.}.jpg

With parallel:

ls *.png | parallel convert {} {.}.jpg

Example 2: Fast Log File Analysis

Process many log files in parallel, counting lines containing "ERROR":

ls *.log | xargs -n 1 -P 4 grep -c ERROR

Or with parallel:

ls *.log | parallel grep -c ERROR {}

Example 3: Clearing Cache on Multiple Servers

cat servers.txt | parallel -j 10 ssh {} 'sudo systemctl restart memcached'

Restarts the memcached service on up to 10 servers at once.


Tips for Safe Parallel Processing

  • Check for race conditions and resource conflicts if jobs might write to the same file.
  • Limit the number of jobs with -P (xargs) or -j (parallel) to avoid overloading your system.
  • Monitor CPU/Memory usage using htop or top during execution.
  • Ensure your script or command supports concurrent execution safely.
  • Test with small batches first to verify correctness before scaling up.

Conclusion

Parallel processing with xargs or GNU parallel can give huge speed boosts to your scripts and workflows by utilizing all available CPU cores. For basic tasks, xargs -P is usually sufficient, while GNU parallel provides richer features for more complex workloads.

Start using parallel processing today to save time and get more done, faster, right from your command line!


References: