SNP calling V – SNP and genotype calling

The next step in the pipeline after the alignments are conducted and adjusted is to call the SNPs and genotypes. There are many programs that will do this such as the ML script that many people in the Rieseberg lab use, freeBayes, mpileup and bcftools, as well as the GATK UnifiedGenotyper. To bench mark the alignment steps in these SNP calling blog posts I used mpileup and bcftools. For example:

samtools mpileup  -u -D -f  <your reference>  *.bam | bcftools view -bvcg – > var.raw.bcf

bcftools view var.raw.bcf > var.flt.vcf

When genotyping 300 individuals, mpileup was taking a very very long time. As there is no option to use multiple processors you can run mpileup on each contig separately and then combine the vcf files at the end. That way you can run several at once depending on the number of processors you have using parallel and the short shell script below.

cat <list of your contig names> | parallel –j<processor number> mpileup.sh

mpileup.sh:

samtools mpileup  -u -D –r $1 -f  <your reference>  *.bam | bcftools view -bvcg – > var.raw.bcf

The UnifiedGenotyper does have an option for using multiple processors:

java -jar <GATK_PATH>/GenomeAnalysisTK.jar -R <REF>   -T UnifiedGenotyper  -I <your file1>.bam  -I <your file2>.bam  (keep adding all your sorted and indexed bam files in this way)  -o <out>.vcf  -ntc 10

In a comparison of the UnifiedGenotyper and mpileup, I found more SNPs were called with mpileup and 289,236 of the filtered SNPs overlapped:

All SNP Filtered SNPs
UnifiedGenotyper (bwa mem, mark duplicates, indel realignment, BQ off (default)) 2,983,495 331,320
mpileup/bcftools (bwa mem, mark duplicates, indel realignment, BQ off (-BQ0)) 2,987,196 394,724