GNU Parallel (Chris)

GNU Parallel makes it easy to take advantage of multiple processors at once. I really enjoy using it to parallelize Bash scripts. For example, here’s a pipeline that will generate a reference-based transcriptome from Illumina-sequenced ESTs: . . .

#!/bin/bash
# ESTpipeline.sh #
prefix=$1

# align each mate, trimming low quality bases
bwa aln -t 2 -I -q 20 TranscriptomeRef_mar30.fa $prefix\_1.fq >$prefix\_1.sai
bwa aln -t 2 -I -q 20 TranscriptomeRef_mar30.fa $prefix\_2.fq >$prefix\_2.sai
bwa sampe TranscriptomeRef_mar30.fa $prefix\_1.sai $prefix\_2.sai $prefix\_1.fq $prefix\_2.fq >$prefix.sam

# sam to bam conversion
samtools view -u -t TranscriptomeRef_mar30.fa.fai -S $prefix.sam -o $prefix.bam
samtools sort $prefix.bam $prefix.sort
samtools index $prefix.sort.bam

# realign around indels
java -Xmx1g -jar GenomeAnalysisTK.jar -T RealignerTargetCreator -R TranscriptomeRef_mar30.fa -o $prefix.realign.intervals -I $prefix.sort.bam > $prefix.realign1.log
java -Xmx1g -jar GenomeAnalysisTK.jar -T IndelRealigner -R TranscriptomeRef_mar30.fa -targetIntervals $prefix.realign.intervals -I $prefix.sort.bam -o $prefix.realign.bam > $prefix.realign2.log
java -jar AddOrReplaceReadGroups.jar I=$prefix.realign.bam O=$prefix.rehead.bam SORT_ORDER=coordinate RGID=$prefix RGLB=$prefix RGPL=illumina RGSM=4BASE_NAME RGPU=$prefix CREATE_INDEX=True VALIDATION_STRINGENCY=SILENT
java -jar GenomeAnalysisTK.jar -l INFO -R TranscriptomeRef_mar30.fa -T UnifiedGenotyper -I $prefix.rehead.bam -o $prefix.vcf --output_mode EMIT_ALL_SITES >$prefix.genotype.log

# vcf to diploid vertical alignment conversion
echo $prefix >$prefix.Valign
cat $prefix.vcf | ./vcf2vertical.pl >>$prefix.Valign

# remove intermediate files
rm $prefix\_1.sai $prefix\_2.sai $prefix.sam $prefix.bam $prefix.sort.bam $prefix.rehead.bam $prefix.realign.intervals $prefix.vcf

Notice the third line. The script takes a single parameter as input (a file name prefix) and reuses it again and again. Now, say we have a directory of Illumina ESTs represented as FastQ files:

[clhead] illumina $ ls *.fq
Ames449_1.fq ARG1834_1.fq btm17-4_1.fq btm25-2_1.fq btm34-6_1.fq
Ames449_2.fq ARG1834_2.fq btm17-4_2.fq btm25-2_2.fq btm34-6_2.fq
Ames695_1.fq arg2B-4_1.fq btm19-1_1.fq btm26-4_1.fq btm5-1_1.fq
Ames695_2.fq arg2B-4_2.fq btm19-1_2.fq btm26-4_2.fq btm5-1_2.fq
arg11B-11_1.fq arg4B-8_1.fq btm19-2_1.fq btm27-3_1.fq btm7B-14_1.fq
arg11B-11_2.fq arg4B-8_2.fq btm19-2_2.fq btm27-3_2.fq btm7B-14_2.fq
arg14B-7_1.fq arg6B-1_1.fq btm20-8_1.fq btm30-6_1.fq btm9-4_1.fq
arg14B-7_2.fq arg6B-1_2.fq btm20-8_2.fq btm30-6_2.fq btm9-4_2.fq
ARG1805_1.fq btm10-5_1.fq btm21-4_1.fq btm31-6_1.fq
ARG1805_2.fq btm10-5_2.fq btm21-4_2.fq btm31-6_2.fq
ARG1820_1.fq btm13-4_1.fq btm22-8_1.fq btm32-3_1.fq
ARG1820_2.fq btm13-4_2.fq btm22-8_2.fq btm32-3_2.fq

If we make a list of the file name prefixes, we can easily process them all in parallel, taking advantage of all the cores available on the computer:

[crunch09] illumina $ ls *_1.fq | sed -e "s/_1.fq//" >prefix.list ; cat prefix.list
Ames449
Ames695
arg11B-11
arg14B-7
ARG1805
ARG1820
ARG1834
arg2B-4
arg4B-8
arg6B-1
btm10-5
btm13-4
btm17-4
btm19-1
btm19-2
btm20-8
btm21-4
btm22-8
btm25-2
btm26-4
btm27-3
btm30-6
btm31-6
btm32-3
btm34-6
btm5-1
btm7B-14
btm9-4

How many jobs should we launch? We can count the computer’s processors like this:

[crunch09] ~ $ cat /proc/cpuinfo | grep -c processor
16

There are 16 processors on this node. We can check their load using top:

[crunch09] ~ $ top
top - 18:28:37 up 160 days, 12:01, 1 user, load average: 0.00, 0.00, 0.00
Tasks: 276 total, 1 running, 274 sleeping, 0 stopped, 1 zombie
Cpu(s): 0.0%us, 0.0%sy, 0.0%ni, 99.9%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 49433544k total, 34412240k used, 15021304k free, 178668k buffers
Swap: 4194296k total, 0k used, 4194296k free, 33565872k cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
22493 fitzjohn 16 0 30960 2448 1568 S 0.7 0.0 30:24.87 top
17259 grassa 15 0 30868 2308 1560 R 0.3 0.0 0:00.08 top
1 root 15 0 10348 704 592 S 0.0 0.0 0:02.21 init
2 root RT -5 0 0 0 S 0.0 0.0 0:03.35 migration/0
3 root 34 19 0 0 0 S 0.0 0.0 0:00.07 ksoftirqd/0
4 root RT -5 0 0 0 S 0.0 0.0 0:00.00 watchdog/0
5 root RT -5 0 0 0 S 0.0 0.0 0:00.78 migration/1
6 root 34 19 0 0 0 S 0.0 0.0 0:00.08 ksoftirqd/1
7 root RT -5 0 0 0 S 0.0 0.0 0:00.00 watchdog/1
8 root RT -5 0 0 0 S 0.0 0.0 0:00.60 migration/2
9 root 34 19 0 0 0 S 0.0 0.0 0:00.11 ksoftirqd/2

They’re all free! Those processors should be generating information. We can now use GNU Parallel to execute the Bash script on fourteen processors (leaving two free) at a time:

cat prefix.list | parallel -j 14 ./ESTpipeline.sh

In this example, GNU Parallel will feed each line of standard input to the Bash script as a single parameter which will be executed on one of fourteen processors.

3 thoughts on “GNU Parallel (Chris)

  1. Awesome tool! So much better than my old standby of perl scripts with exec calls.

    One note of caution to those new to parallel processing (aka multithreading) – if your programs access the same resources, you might end up with unexpected results. This is called a race condition. Examples are if you run multiple programs that write to the same file or run programs that depend on each other’s output. Unless you do extra steps to ensure that the processes execute in the order you want, you might end up with bad output. If your programs don’t update common resources or depend on each other, then definitely use parallel processing to speed things up.

    • This is a very important point! It’s the reason behind using a self-contained Bash script, each running on a single core, for the pipeline. One cool thing to note about Parallel is the -k flag, which will keep the output order the same as the input order for simple commands that take STDIN and write to STDOUT.

      The man page includes a nice collection of examples:
      http://www.gnu.org/software/parallel/man.html

Comments are closed.