SNP calling I – alignment programs and PCR duplicates

I have tested various SNP calling methods using exome re-sequencing data from 12 interior spruce samples. I tried Bowtie2, BWA (mem), Picard (mark duplicates) and GATK for indel realignment and base quality recalibration. For SNP calling I used mpileup with and without BAQ as well as the Unified Genotyper from GATK. For an interesting and informative workshop outlining the Broad best practices SNP calling pipeline check out these youtube videos (http://www.youtube.com/watch?v=1m0ZiEvzDKI&list=PLlMMtlgw6qNgNKNv5V9qmjAxbkHAZS1Mf). My results are in a series of blog posts and I hope you find them useful. Please let me know if you have any suggestions for SNP calling. We only want to do the alignments and SNP calling once for the entire set of samples, because it is going to take a long time!

Bowtie2 versus BWA mem:

BWA mem (http://bio-bwa.sourceforge.net/bwa.shtml) and Bowtie2 (http://bowtie-bio.sourceforge.net/bowtie2/manual.shtml) produced many different SNPs. In part this is likely due to differences in the default settings (there were 227,918 SNPs that overlapped; 262,690 unique in BWA and 10,692 unique in Bowtie2). I noticed that some of the SNPs that were found in BWA and absent in Bowtie2 were the result of soft clipping in BWA, where the program clips off the ends of the reads if they do not match the reference. Therefore I reran the program to allow for local alignment in Bowtie2. This increased the SNP number for the Bowtie2 alignments, but did not account for all of the difference (there were 268,620 SNPs that overlapped; 221,988 unique in BWA and 18,338 unique in Bowtie2). One major difference between the programs was the alignment time, as Bowtie2 took about two times longer than BWA mem.

All SNPs Filtered SNPs Command line
bwa mem basic 3,530,404 490,608 bwa mem -t 2 -M <REF>  R1.fastq  R2.fastq   >  out.sam
bowtie2 basic sensitive 1,249,719 238,610 bowtie2 –sensitive -p 2 -x <REF> -1  R1.fastq -2  R2.fastq >  out.sam
bowtie2 local sensitive 1,551,362 286,958 bowtie2 –local –sensitive -p 2 -x <REF> -1  R1.fastq -2  R2.fastq >  out.sam

Note: if you want to use Picard downstream you must use the –M option for BWA mem.

We will likely use BWA mem for aligning our data for three reasons: 1) time savings; 2) the larger number of SNPs identified with BWA; and 3) BWA mem can use a reference the size of our full genome (~20Gb) while Bowtie2 limits the size of the reference to about 3Gb.

The PCR duplicate problem:

PCR duplicates are problematic for calling SNPs and genotypes. This is because a replication error created during PCR can be amplified and be called as a SNP even thought it is just a PCR artifact. PCR amplification can also bias the number of copies for heterozygous individuals so they may appear homozygous when they are not. To avoid this problem you should mark your duplicates with Picard, unless your sequencing method does not allow for it. With Picard duplicates are identified by the start and stop position in the reference as well as the CIGAR string. Genotype by sequencing (GBS) based on restriction enzymes will often have the same start and stop for each locus regardless of whether they are PCR duplicates or unique copies, so do not mark duplicates if you have GBS data. Once you mark the duplicates, the genotyping programs (mpileup and UnifiedGenotyper) will ignore duplicated reads. Alternatively, you could remove them yourself from the bam files (e.g. samtools view -F 0x400 <your.bam>).

In the table below, the number of SNPs that were called after removing PCR duplicates was much smaller because of the high level of PCR duplication inherent in this sequence capture protocol as two rounds of PCR are required. The spruce compared to pine (not shown) seemed to be particularly prone to this problem for some reason, despite the use of a relatively low number of PCR cycles during library prep.

Below are the numbers of SNPs recovered for Bowtie2 and BWA mem before and after marking the duplicate reads.

All SNPs Filtered SNPs
bwa mem basic 3,530,404 490,608
bwa mem marked dup 1,916,742 290,898
bowtie2 basic 1,249,719 238,610
bowtie2 marked dup 812,857 157,955
bowtie2 local 1,551,362 286,958
bowtie2 local marked dup 1,133,245 195,121

This is the command line I used after indexing and sorting the bam files (note I used a different tmp directory than the default to get it to work). I had to play around with the options below to get it to run because of memory requirements:

java -Xmx10g -XX:MaxPermSize=1g -Djava.io.tmpdir=<tmpdir> -jar <PICARD_PATH>/MarkDuplicates.jar INPUT=<your alignment>.sort.bam OUTPUT=<your marked alignment>.bam METRICS_FILE=<out>.metrics CREATE_INDEX=true VALIDATION_STRINGENCY=LENIENT AS=true MAX_RECORDS_IN_RAM=500000000 MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=1000