AWK unlinked SNP subsampling

For some analyses, like STRUCTURE, you often want unlinked SNPs. For my GBS data I ended up with from 1 to 10 loci on each contig which I wanted to subsample down to just one random loci per contig. It took me a while to figure out how to do this, so here is the script for everyone to use:

cat YOUR_SNP_TABLE | perl -MList::Util=shuffle -e ‘print shuffle(<STDIN>);’ | awk ‘! ( $1 in a ) {a[$1] ; print}’ | sort > SUBSAMPLED_SNP_TABLE

It takes you snp table, shuffles the rows using perl, filters out one unique row per contig using awk, then sorts it back into order. For my data, the first column is CHROM for the first row and then scaffold###### for the subsequent rows so the sort will place the CHROM row back on top. It might not for yours if you have different labels.

AWK (Seb)

What is awk?

AWK is a language for processing files of text. A file is treated as a sequence of records, and by default each line is a record. Each line is broken up into a sequence of fields, so we can think of the first word in a line as the first field, the second word as the second field, and so on. An AWK program is of a sequence of pattern-action statements. AWK reads the input a line at a time. A line is scanned for each pattern in the program, and for each pattern that matches, the associated action is executed.” – Alfred V. Aho

Why awk?

1.AWK is simpler to use than most conventional programming languages.
2. It is fast.
3. It has string manipulation functions, so it can search for particular strings and modify the output.
4. A version of the AWK language is a standard feature of nearly every modern Unix-like operating system available today.

Simple examples on how to use AWK:
Continue reading

Filtering unmapped/unaligned reads from SAM files (Rose)

This is a post about some time-saving help Chris Grassa gave me.

STACKS (post coming soon) doesn’t deal well with all of the unaligned reads in SAM files, so I tried using PICARD to remove them. However, PICARD doesn’t like the SAM output of BWA, but Chris G showed me how to use the Unix command awk to do it much more easily. This is his command for my file 1076.sam:
Continue reading