Previously I posted a pipeline for processing GBS data. At the end of the pipeline, there was a step where the loci were filtered for coverage and heterozygosity. How strict that filtering was could be changed, but it wasn’t obvious in the script and it was slow to rerun. I wrote a new script that takes a raw SNP table (with all sites, including ones with only one individual called) and calculates the stats for each loci. It is then simple and fast to use awk to filter this table for your individual needs. The only filtering the base script does is not print out sites that have four different bases called at the site.
What is awk?
“AWK is a language for processing files of text. A file is treated as a sequence of records, and by default each line is a record. Each line is broken up into a sequence of fields, so we can think of the first word in a line as the first field, the second word as the second field, and so on. An AWK program is of a sequence of pattern-action statements. AWK reads the input a line at a time. A line is scanned for each pattern in the program, and for each pattern that matches, the associated action is executed.” – Alfred V. Aho
1.AWK is simpler to use than most conventional programming languages.
2. It is fast.
3. It has string manipulation functions, so it can search for particular strings and modify the output.
4. A version of the AWK language is a standard feature of nearly every modern Unix-like operating system available today.
Simple examples on how to use AWK:
I hope we can start a conversation about this because a good text editor can make a big difference to a newbie, so PLEASE REPLY!!! I wanted to proselytise about Npp, but it only runs on Windows. So if you use a different OS, please make that BLEEDINGLY OBVIOUS.
Notepad++ (WINDOWS )
I’ve tried a numerous text editors over the years (like Context), but Notepad++ (Npp) is easily my favourite. It only runs on Windows, but I use it to export Unix formatted files routinely. You can set shortcut keys to change formats very easily. Npp can highlight lots of languages, including R, perl and unix. You can also define your own languages for highlighting – I did that to make my Migrate parameter files easier to read.
This is a useful tool for anyone who uses ssh. It allows you to close your terminal and still have processes running. This is very useful if you are using a laptop for example. You can check in on the process from anywhere afterwards.