Picard MarkDuplicates Troubleshooting

Picard is a java-based bioinformatics tools suite that handles SAM/BAM files.  Chris introduced me to it, and it’s pretty handy.  However, it is very particular about the SAM format, which can leave you searching the FAQs and forums for quite a while.

The MarkDuplicates tool looks for duplicate reads in your library and either flags them in your SAM/BAM file or deletes them, depending on your settings.  It can also output a metrics on duplication.

If you are doing any analysis which takes coverage into account, you will want to remove PCR duplicates from your libraries so that they do not artificially inflate coverage numbers.  The folks from Celera are adamant about this step, since it can help reduce complexity of genome assembly.

Reads are considered duplicates if they have the same orientation and their 5′ ends map to the same reference coordinate.  Picard does not care if reads are part of a pair and either pair end maps to separate sequence entries (e.g separate scaffolds or contigs or chromosomes) in your reference fastas file.  This makes it better than samtools rmdup, which is unable to handle read pairs that hit separate reference sequence entries.

Reads marked as duplicate will have the 0x400 bit set in SAM flag (column 2).  To check if your read has that bit set, use the Picard Flag Decoder  website.  When duplicate reads are deleted, the highest quality read is kept intact in the SAM/BAM.

Troubleshooting

Which reads are duplicated?

If you told MarkDuplicates to remove the duplicates by specifying REMOVE_DUPLICATES=true on the commandline, you will not see them in your processed BAM.  Otherwise, you can view your BAM file with -f on the commandline to filter for only the reads with 0x400 flag set, indicating that it is duplicated.

samtools view -f 0x400 <bam file>

Ignoring SAM validation error: ERROR: Record 1114712, Read name SOLEXA11:2:119:5967:4160#0, MAPQ should be 0 for unmapped read

Fix:  Picard is very picky about the SAM headers.  Samtools doesn’t really update all of the columns correctly for read-pair alignments where one end aligns and the other doesn’t.  Even running “samtools fixmate” doesn’t help.  You can tell Picard to either shut up using the commandline flag VALIDATION_STRINGENCY=SILENT or to log the errors and continue using the commandline flag VALIDATION_STRINGENCY=LENIENT.  I prefer the latter (lenient), since there might be real issues with your data that might come up later.  Simply check your logfiles that all the warning messages are expected.

Your BAM is sorted but Picard does not think so

If you used samtools sort to sort your BAM, it does not update the sort order (SO) header to indicate that bam file is sorted.  Specify the commandline flag ASSUME_SORTED = true to get rid of the errors.

How does Picard Count its Duplication Metrics?

Picard MarkDuplicate specifies the duplicates under 3 columns:  UNPAIRED_READ_DUPLICATES, READ_PAIR_DUPLICATES, READ_
PAIR_OPTICAL_DUPLICATES.  When it reports READ_PAIR_DUPLICATES, it only reports each set of read pairs.  Also, PAIR_OPTICAL_DUPLICATES is a subset of READ_PAIR_DUPLICATES.  Read Pair Duplicates consist of PCR duplicates, which can have different sequence and Optical duplicates, which have all have the same sequence.

The total number of duplicated reads, where reads refers to a single end read, or either end of a paired read, use this calculation:

Total Duplicated Reads = UNPAIRED_READ_DUPLICATES + 2 x READ_PAIR_DUPLICATE

Exception in thread “main” net.sf.picard.PicardException: Value was put into PairInfoMap more than once

Picard does not like it when the same exact alignment for one specific read is reported multiple times. Here, I am not referring to multiple reads aligning to the same locations, indicating that they are duplicates of each other, I am referring to the same exact alignment for one read reported multiple times. Did you merge multiple BAM files into one before feeding it into picard? You might have alignments for the exact same read reported twice in the merged alignment. Ensure that each alignment for a specific read end to a scaffold/contig/chromosome only occurs once.

Picard Can Not Find the Sequence Dictionary

Your SAM file is missing header information.  Create header information using the reference fasta. Simply passing the reference fasta file in the picard commandline won’t work.

 java -jar CreateSequenceDictionary.jar OUTPUT=header.sam REFERENCE=ref.fa

Then concatenate the header with your SAM file.  You might want to use samtools merge or picard merge if you have bam files.

cat header.sam file.sam >> file.header.sam