SNP calling III – The indel problem

SNPs can often be falsely called around indels. In particular, if indels are near the start or end of a read the read is often incorrectly aligned. Here is an example of how things can go wrong from a spruce alignment. Below, at site 109 there is a TT insertion in the reads relative to the reference and there is no true SNP at this site. However, this individual was called as a het (C/T) because of the reads that end in the middle of the insertion. For these reads the penalty for a single mismatch to the reference is less than introducing a gap.

Screen Shot 2013-08-15 at 10.13.34 AM

To illustrate this more clearly:

Reference sequence:  …GCATCA — CTAT…

Actual sequence:         …GCATCATTCTAT…

Reference sequence:   …GCATCAC

Poorly aligned reads:   …GCATCAT

GATK deals with this problem by realigning around indels. Briefly, the program identifies the best alternative consensus sequence for each region covering an indel. In my case when I implemented the indel realignment, only a small proportion of sites were changed (394,308 filtered SNPs overlapped; 406 unique to indel realignment and 384 were unique when there was no indel realignment). Although the number of SNPs affected is small, the false SNP call illustrated above was fixed.

Screen shot 2013-07-02 at 11.12.05 AM

All SNPs Filtered SNPs
bwa mem, mark duplicates, no indel realignment, BQ off (-BQ0) 2,988,217 394,692
bwa mem, mark duplicates, indel realignment, BQ off (-BQ0) 2,986,963 394,714

These are the command lines I used after I changed the read group headers and sorted and indexed the bam files:

#ID indels

java -Xmx15g -jar <GATK_PATH>/GenomeAnalysisTK.jar -T RealignerTargetCreator -R <REF> -nt 10 -o <your.bam.list> -I <input bam file>

#realign around indels

java -Xmx10g -jar <GATK_PATH>/GenomeAnalysisTK.jar -I <input bam file> -R <REF> -T IndelRealigner -targetIntervals <your.bam.list> -o <realigned bam file>

Note that you can use a database of known indels to speed the TargetCreator (add –known INDELS.vcf ). If you are using a non-model organism you can use a vcf file from a preliminary run. Also, if you are using the GATK HaplotypeCaller you do not need to realign around indels.

Mpileup documentation states that it deals with the indel problem by implementing a BAQ filter. The program assigns each base a BAQ score (Phred-scaled probability of the base being misaligned). A BAQ score will be low if the base is aligned to a different reference base in a suboptimal alignment and will contribute little to SNP calling even if the base quality is high. With the BAQ filter on, almost all of these potentially miscalled SNPs are removed (311 out of 384 were removed including the example illustrated above), along with many more (compare the table below to the table above). It seems that BAQ filters many more SNPs that are not falsely called due to indels, but they have alternative placement in the suboptimal alignments due to some other reason. Some people feel that BAQ is conservative and misses known SNPs (e.g. http://sourceforge.net/p/varscan/discussion/1073559/thread/6a7f53c2). BAQ can also be implemented in GATK for the UnifiedGenotyper, although this is not the default setting.

All SNPs Filtered SNPs
bwa mem, mark duplicates, no indel realignment, BQ on (default) 1,916,742 290,898
bwa mem, mark duplicates, indel realignment, BQ on (default) 1,916,727 290,894

Picard fixmate:

This apparently was once a part of the workflow after indel realignment in GATK. However, it does not appear to be necessary anymore (http://gatkforums.broadinstitute.org/discussion/1645/is-picard-fixmate-necessary). You can see the number of SNPs with and without the fixmate is the same.

All SNPs Filtered SNPs
bwa mem, mark duplicates, indel realignment, picard fixmate 2,987,117 394,687
bwa mem, mark duplicates, indel realignment, no picard fixmate 2,987,117 394,687