SNP calling with ML (Greg B.)

Email for v7. Bug found that printed G’s as C’s and vise versa.

Call SNPs from sam files in a method similar to Hohenlohe et al 2010. Updated to v4 Feb 9. Previous version had a bug.

It is now fixed up for all of BWAs cigars flavors.

This only deals with reads that fit one of the following:
Full alignment (55M)
Soft clip at the start (10S45M)
Soft clip at end (45M10S)
One deletion (25M10D25M)
One insertion (20M10I20M)

This means it ignores reads with a cigar fields that have N, H, P, = or X and it ignores reads with a cigar more complicate then a single soft clip or a single indel. It also does not penalize reads adjacent to indels.

It ignores bases in soft clipped parts of reads

Continue reading

RAM cache problems on Linux (Nolan)

I just wanted to tell you about a common, and as far as I know unavoidable, problem that occurs in all Linux distributions I have worked with on machines that have large memory.  The RAM cache fills up and is not emptied properly.
The RAM cache is used extensively when large files are read and written, and for some reason the RAM cache gradually fills up due to file transfers and is not cleared very quickly.  In some cases when another program requires memory the memory just isn’t there – you can see now on hulk no memory is being used by programs, yet over 100 Gb is being used by the cache, so only about half of the RAM is listed as available.  In many cases this can use up so much RAM that the computer has to switch to using SWAP space, which happens regularly on the zoology cluster and all of our high-memory machines.  I have seen this on RedHat and related, Ubuntu, and OpenSuse, among others. The only way I have found to get rid of this problem is to manually clear the cache, which only works on our machines as you need root access.  Although this is not an ideal solution it hasn’t caused me problems so far.  When using high-RAM programs I frequently clear the RAM cache to make sure the Free Memory stays at the max. To do that:
sudo su
sync; echo 3 > /proc/sys/vm/drop_caches
exit

Don’t worry if a file transfer is occurring, the sync command protects it. But, don’t alter these commands unless you are sure you know what you are doing. I have talked to many very smart people about this and nobody has come up with a better solution. Let me know, though, if any of you have better ideas.

Population Genomics! (Greg B.)

Ask me for the most current version of these if you want to use them.

Several people in the lab are now working with very similar data (in structure) and have similar questions (in technique). As I have discussed with several people it would be very useful if we could all build up and share the tools needed to do these analysis. Understandably, people may want to do things on their own or in a specific manner, but I think there are several advantages to building this up together. The main thing is that it will be more efficient in terms of personnel time, having each person re-invent the wheel does not make sense. I think the blog is a great place to set do this. Below we can make our wish list and link to posts for solutions as they become available. As always, the principles (1,2 and 3) covered here apply.

Continue reading

AWK (Seb)

What is awk?

AWK is a language for processing files of text. A file is treated as a sequence of records, and by default each line is a record. Each line is broken up into a sequence of fields, so we can think of the first word in a line as the first field, the second word as the second field, and so on. An AWK program is of a sequence of pattern-action statements. AWK reads the input a line at a time. A line is scanned for each pattern in the program, and for each pattern that matches, the associated action is executed.” – Alfred V. Aho

Why awk?

1.AWK is simpler to use than most conventional programming languages.
2. It is fast.
3. It has string manipulation functions, so it can search for particular strings and modify the output.
4. A version of the AWK language is a standard feature of nearly every modern Unix-like operating system available today.

Simple examples on how to use AWK:
Continue reading

STACKS installation (Rose)

Installing stacks on Ubuntu Natty Narwhal or Oneiric Ocelot

STACKS is a piece of software produced by Julian Catchen in the Cresko lab. It’s designed to identify loci and alleles from RAD (or GBS) reads either de novo or after alignment to a reference. It consists of several modules that can be run separately, but to completely install it as a pipeline, it relies on a web server, unfortunately. Many of the required instructions are given in the README file, but because nobody in our lab is an expert on this, we had to fiddle around to get the program running on our Ubuntu machines.

Continue reading

How to post – code (Dan E.)

We have a problem sharing code via RLR.

The Problem

Unfortunately WordPress has a list of acceptable file types that it allows to be uploaded to our media library and none of the useful coding file types are on that list. The list is simply a list of acceptable file extensions. This means if you write a useful R script (or perl or python) script and save it with a standard file extension, like .R or .pl, WordPress will not allow you to upload it to the RLR media library so that you can share it via a post.

The Solution

The list of acceptable file extensions can be hacked and I might give it a try but, until I do, you will have to do one of these things:

  • Change the file extension. If you save your script as a .txt file it will upload fine. You should make it clear in your post what kind of script it is and then people who download it can change the .txt extension to whatever they want.
  • Put the code in your post. If your script is not too long you can simply copy and paste the code from your text editor into the post editor. The formatting of the code will remain true to the original so users can simply copy and paste it back out into a text editor or R-Studio or wherever. See Rose’s post about plotting STRUCTURE results for an example of this.
  • Compress your script file. If your script is big you can try zipping it and then uploading the compressed file. Users can then just download and unzip it. [As of November 2011 this hasn’t been tested.]

Dan E.

R script for plotting STRUCTURE results (Q values) (Rose)

This is an R Script that plots individual Q values and labels populations. It can be modified to take average group membership from CLUMPP output and/or to import different population names and higher level groupings from elsewhere.

N.B. I haven’t run this on very many data sets, so it will probably need to be tweaked for your results. But please leave a comment if you run into any problems.

Continue reading

Our favourite text editors (Rose)

I hope we can start a conversation about this because a good text editor can make a big difference to a newbie, so PLEASE REPLY!!! I wanted to proselytise about Npp, but it only runs on Windows. So if you use a different OS, please make that BLEEDINGLY OBVIOUS.

Notepad++ (WINDOWS )

I’ve tried a numerous text editors over the years (like Context), but Notepad++ (Npp) is easily my favourite. It only runs on Windows, but I use it to export Unix formatted files routinely. You can set shortcut keys to change formats very easily. Npp can highlight lots of languages, including R, perl and unix. You can also define your own languages for highlighting – I did that to make my Migrate parameter files easier to read.
Continue reading

Filtering unmapped/unaligned reads from SAM files (Rose)

This is a post about some time-saving help Chris Grassa gave me.

STACKS (post coming soon) doesn’t deal well with all of the unaligned reads in SAM files, so I tried using PICARD to remove them. However, PICARD doesn’t like the SAM output of BWA, but Chris G showed me how to use the Unix command awk to do it much more easily. This is his command for my file 1076.sam:
Continue reading