AWK (Seb)

What is awk?

AWK is a language for processing files of text. A file is treated as a sequence of records, and by default each line is a record. Each line is broken up into a sequence of fields, so we can think of the first word in a line as the first field, the second word as the second field, and so on. An AWK program is of a sequence of pattern-action statements. AWK reads the input a line at a time. A line is scanned for each pattern in the program, and for each pattern that matches, the associated action is executed.” – Alfred V. Aho

Why awk?

1.AWK is simpler to use than most conventional programming languages.
2. It is fast.
3. It has string manipulation functions, so it can search for particular strings and modify the output.
4. A version of the AWK language is a standard feature of nearly every modern Unix-like operating system available today.

Simple examples on how to use AWK:

Seb
http://www.pement.org/awk/awk1line.txt

Greg_B

# - SAM files - #
#Count number of reads aligning to each contig/chromosome and print total and as a percent
awk '{c[$3]++}END{for(j in c) print j,c[j],(c[j]/NR*100),"%"}' Aligned.sam

# - Blast files - #

# Not the prettiest (piping awk into awk) but gets the job done quickly, improvements welcome!
# count hits
wc blast_all_vs_all/trin_vs_trin.tab
3028796
#no self hits
awk '$1!=$2' blast_all_vs_all/trin_vs_trin.tab > blast_all_vs_all/no_self
wc blast_all_vs_all/no_self
2958010
# how many matches are 200bp +?
awk '$4>200' blast_all_vs_all/no_self | wc
17867
#of those how many have 80% ID?
awk '$4>200' blast_all_vs_all/no_self | awk '$3>80' | wc
14151
# over 90%
awk '$4>200' blast_all_vs_all/no_self | awk '$3>90' | wc
143

2 thoughts on “AWK (Seb)

  1. Thanks Seb.
    I think that as you and others come up with one-liners to post they should probably be added to the body of this post rather than in the comments here or in a separate post.
    Dan.

  2. “I want to get the number of lines and file name from a directory of sam files
    but I only want to count lines in which column six does not equal ‘*’ and do not match ‘@SQ’.”

    awk ‘$6!=”*” && $1 !~/@SQ/ {sum++} END {print sum, FILENAME}’ sam/* >sam_counts

    There are a few cool things to note here:
    sum++ #we increment the count as we go through the file
    END #we issue a command at the end of each file
    FILENAME #this is a super handy built-in awk variable

Comments are closed.