SNP calling IV – Base quality score recalibration

The best practices protocol from Broad highly recommends base quality recalibration (http://gatkforums.broadinstitute.org/discussion/44/base-quality-score-recalibration-bqsr). Apparently, the base quality scores off the machine are not very accurate and can provide false confidence in the base calls and consequently influence SNP calling. This tool will adjust average quality scores and also adjust the scores depending on the machine cycle and sequence context (i.e. start versus end of read, type of dinucleotide repeat). After recalibration, the quality score is closer to its actual probability of mismatching the reference genome.

To run this you need a database of known SNPs so the tool can assess the error rate at other sites based on the actual sequence data. Unlike humans, which this tool was designed for, most species do not have comprehensive SNP databases. However, such a table can be created by identifying SNPs in a preliminary run without the base quality score recalibration. I included any SNP that had a Qual score of 20 in my SNP vcf file.

You can see below that it had a large effect on the number of SNPs that I called. With the BAQ filter off 263,360 SNPS overlapped between the methods. 14,591 were unique to the recalibrated alignments and 131,354 were unique to the alignments without base quality recalibration.

All SNPs Filtered SNPs
bwa mem, mark duplicates, indel realignment, base recalibration,  BQ off (-BQ0) 1,619,567 277, 951
bwa mem, mark duplicates, indel realignment, no base recalibaration,  BQ off (-BQ0) 2,987,117 394,687

Here are the command lines I used to run the base quality score recalibration. It is a two step process:

java -Xmx15g -Djava.io.tmpdir=<tmpdir> -jar <GATK_PATH>/GenomeAnalysisTK.jar  -T BaseRecalibrator -I <your bam file> -R <REF> -knownSites <your SNPs.vcf> -o <your.table>

java -Djava.io.tmpdir=<tmpdir> -jar <GATK_PATH>/GenomeAnalysisTK.jar -T PrintReads -R <REF> -I <your bam file> -BQSR <your.table> -o <your recalibrated bam file>

BAQ filtering seems to remove many of the problems associated with indels. However it does not deal with base quality score issues. If you want to be more specific in your SNP calls and reduce false positives you may want to implement both BAQ and base recalibration, but with the understanding that you will likely miss some true SNPs.