SNPs can often be falsely called around indels. In particular, if indels are near the start or end of a read the read is often incorrectly aligned. Here is an example of how things can go wrong from a spruce alignment. Below, at site 109 there is a TT insertion in the reads relative to the reference and there is no true SNP at this site. However, this individual was called as a het (C/T) because of the reads that end in the middle of the insertion. For these reads the penalty for a single mismatch to the reference is less than introducing a gap.
To illustrate this more clearly:
Reference sequence: …GCATCA — CTAT…
Actual sequence: …GCATCATTCTAT…
Reference sequence: …GCATCAC
Poorly aligned reads: …GCATCAT
GATK deals with this problem by realigning around indels. Briefly, the program identifies the best alternative consensus sequence for each region covering an indel. In my case when I implemented the indel realignment, only a small proportion of sites were changed (394,308 filtered SNPs overlapped; 406 unique to indel realignment and 384 were unique when there was no indel realignment). Although the number of SNPs affected is small, the false SNP call illustrated above was fixed.
All SNPs | Filtered SNPs | |
bwa mem, mark duplicates, no indel realignment, BQ off (-BQ0) | 2,988,217 | 394,692 |
bwa mem, mark duplicates, indel realignment, BQ off (-BQ0) | 2,986,963 | 394,714 |
These are the command lines I used after I changed the read group headers and sorted and indexed the bam files:
#ID indels
java -Xmx15g -jar <GATK_PATH>/GenomeAnalysisTK.jar -T RealignerTargetCreator -R <REF> -nt 10 -o <your.bam.list> -I <input bam file>
#realign around indels
java -Xmx10g -jar <GATK_PATH>/GenomeAnalysisTK.jar -I <input bam file> -R <REF> -T IndelRealigner -targetIntervals <your.bam.list> -o <realigned bam file>
Note that you can use a database of known indels to speed the TargetCreator (add –known INDELS.vcf ). If you are using a non-model organism you can use a vcf file from a preliminary run. Also, if you are using the GATK HaplotypeCaller you do not need to realign around indels.
Mpileup documentation states that it deals with the indel problem by implementing a BAQ filter. The program assigns each base a BAQ score (Phred-scaled probability of the base being misaligned). A BAQ score will be low if the base is aligned to a different reference base in a suboptimal alignment and will contribute little to SNP calling even if the base quality is high. With the BAQ filter on, almost all of these potentially miscalled SNPs are removed (311 out of 384 were removed including the example illustrated above), along with many more (compare the table below to the table above). It seems that BAQ filters many more SNPs that are not falsely called due to indels, but they have alternative placement in the suboptimal alignments due to some other reason. Some people feel that BAQ is conservative and misses known SNPs (e.g. http://sourceforge.net/p/varscan/discussion/1073559/thread/6a7f53c2). BAQ can also be implemented in GATK for the UnifiedGenotyper, although this is not the default setting.
All SNPs | Filtered SNPs | |
bwa mem, mark duplicates, no indel realignment, BQ on (default) | 1,916,742 | 290,898 |
bwa mem, mark duplicates, indel realignment, BQ on (default) | 1,916,727 | 290,894 |
Picard fixmate:
This apparently was once a part of the workflow after indel realignment in GATK. However, it does not appear to be necessary anymore (http://gatkforums.broadinstitute.org/discussion/1645/is-picard-fixmate-necessary). You can see the number of SNPs with and without the fixmate is the same.
All SNPs | Filtered SNPs | |
bwa mem, mark duplicates, indel realignment, picard fixmate | 2,987,117 | 394,687 |
bwa mem, mark duplicates, indel realignment, no picard fixmate | 2,987,117 | 394,687 |