AWK unlinked SNP subsampling

For some analyses, like STRUCTURE, you often want unlinked SNPs. For my GBS data I ended up with from 1 to 10 loci on each contig which I wanted to subsample down to just one random loci per contig. It took me a while to figure out how to do this, so here is the script for everyone to use:

cat YOUR_SNP_TABLE | perl -MList::Util=shuffle -e ‘print shuffle(<STDIN>);’ | awk ‘! ( $1 in a ) {a[$1] ; print}’ | sort > SUBSAMPLED_SNP_TABLE

It takes you snp table, shuffles the rows using perl, filters out one unique row per contig using awk, then sorts it back into order. For my data, the first column is CHROM for the first row and then scaffold###### for the subsequent rows so the sort will place the CHROM row back on top. It might not for yours if you have different labels.

One thought on “AWK unlinked SNP subsampling

  1. A note for people trying to use this. I copy and pasted this into the terminal and it gave a syntax error. After some fiddling, I realized that when it pasted in, the apostrophes had been converted somehow in the copying process. The apostrophes should be straight lines, not curly. I’m not sure what the official wording for the difference between those are, but make sure you have the non-curly one.

Comments are closed.