Previously I posted a pipeline for processing GBS data. At the end of the pipeline, there was a step where the loci were filtered for coverage and heterozygosity. How strict that filtering was could be changed, but it wasn’t obvious in the script and it was slow to rerun. I wrote a new script that takes a raw SNP table (with all sites, including ones with only one individual called) and calculates the stats for each loci. It is then simple and fast to use awk to filter this table for your individual needs. The only filtering the base script does is not print out sites that have four different bases called at the site.
A SNP table with a header and the first two columns are contig and position. No spaces between both base calls. No data can be NN or -:
contig \t pos \t name1 \t name2 \t…
sc121 \t 232 \t TA \t TT \t…
Usage: perl snp_label_rows.pl SNP_table > Labelled_SNP_table
The labelled table is tab separated. It starts with CHROM, then POS (same as contig and pos), then coverage (percent samples with a base call), Heterozygosity (observed heterozygosity), Major (major allele frequency), Minor1 (frequency of the second most frequent base), Minor2 (frequency of the third most frequent base), Triallelic? (Yes or No if there is a third base), then each of your samples with their base calls.
CHROM \t POS \t Coverage \t Heterozygosity \t Major \t Minor1 \t Minor2 \t TriAllelic? \t Name1 \t Name2 \t…
You can use awk to filter this table. For example, if I wanted only biallelic snps with >95% coverage, <50% Heterozygosity I would use:
awk ‘($3 > .95) && ($4 < .5) && ($6 > .05) && ($7 == 0) || ($1 == “CHROM”)’ $full_name/$full_name.allsites > $full_name/$full_name.biallele
$3 is coverage, $4 is heterozygosity, $6 is minor allele1, $7 is minor allele2. $1 == “CHROM” ensures the header row is kept.
After filtering with awk, if you want to use the snp table you should probably remove all the stats, which can be done using cut:
cut -f 1,2,9- $full_name/$full_name.biallele > $full_name/$full_name.biallele.snp