Using GATK and Picard was difficult because we had to keep modifying our reference (and therefore redo the alignments each time) to use these programs efficiently. Below I list the things we had to do to the reference to get them to work:
1. Picard was unable to create a dictionary for our large conifer genome (~20Gb). Consequently, we needed to reduce the size of our reference to about 3Gb. We did this by using genome contigs that had blast hits and GMAP alignments against our target sequences for sequence capture. We also included genome contigs that had SNP calls using a preliminary run of 30 individuals against the entire genome.
2. GATK also does not like characters that are not A, T C or G. We replaced all N’s etc. with As.
3. Finally, when running the TargetCreator for the indel realignment (see SNP calling III – The indel problem) it will go much faster if you have a small number of reference contigs (http://gatkforums.broadinstitute.org/discussion/comment/7466#Comment_746). The indel realignment was taking 5 days per individual. When I created pseudo scaffolds by combining the ~800,000 contigs of my reduced reference into 1000 pseudo scaffolds (each genomic contig separated by 30 As, I have perl scripts for this if you are interested) each individual only took about 25 min.
I compared the pseudo scaffold SNP calls to those made against the original contigs. Almost all were the same (overlap 391,856). However, 2833 were only found in the scaffold and 2836 were only found in the contigs. I am looking into the cause of these differences, but it is likely mainly due to slight differences in some of the alignments that are a result of creating the pseudo scaffolds.
|Reference type||All SNPs||Filtered SNPs|
|bwa mem, mark duplicates, indel realignment, BQ0||reduced reference – genomic contigs||2,988,217||394,692|
|bwa mem, mark duplicates, indel realignment, BQ0||reduced reference – pseudo scaffolds||2,988,421||394,690|