Merge SNP calls (Greg B.)

Posted on January 11, 2012 by Greg B.

Rose coded this up as a faster and efficient way to combine all the snp calls into one table. I’ve made a few modifications, hopefully its not broken. Updates are likely in the future.

Continue reading →

SNP calling with ML (Greg B.)

Posted on January 11, 2012 by Greg B.

Email for v7. Bug found that printed G’s as C’s and vise versa.

Call SNPs from sam files in a method similar to Hohenlohe et al 2010. Updated to v4 Feb 9. Previous version had a bug.

It is now fixed up for all of BWAs cigars flavors.

This only deals with reads that fit one of the following:
Full alignment (55M)
Soft clip at the start (10S45M)
Soft clip at end (45M10S)
One deletion (25M10D25M)
One insertion (20M10I20M)

This means it ignores reads with a cigar fields that have N, H, P, = or X and it ignores reads with a cigar more complicate then a single soft clip or a single indel. It also does not penalize reads adjacent to indels.

It ignores bases in soft clipped parts of reads

Continue reading →

RAM cache problems on Linux (Nolan)

Posted on January 11, 2012 by Nolan

I just wanted to tell you about a common, and as far as I know unavoidable, problem that occurs in all Linux distributions I have worked with on machines that have large memory. The RAM cache fills up and is not emptied properly.
The RAM cache is used extensively when large files are read and written, and for some reason the RAM cache gradually fills up due to file transfers and is not cleared very quickly. In some cases when another program requires memory the memory just isn’t there – you can see now on hulk no memory is being used by programs, yet over 100 Gb is being used by the cache, so only about half of the RAM is listed as available. In many cases this can use up so much RAM that the computer has to switch to using SWAP space, which happens regularly on the zoology cluster and all of our high-memory machines. I have seen this on RedHat and related, Ubuntu, and OpenSuse, among others. The only way I have found to get rid of this problem is to manually clear the cache, which only works on our machines as you need root access. Although this is not an ideal solution it hasn’t caused me problems so far. When using high-RAM programs I frequently clear the RAM cache to make sure the Free Memory stays at the max. To do that:
sudo su
sync; echo 3 > /proc/sys/vm/drop_caches
exit

Don’t worry if a file transfer is occurring, the sync command protects it. But, don’t alter these commands unless you are sure you know what you are doing. I have talked to many very smart people about this and nobody has come up with a better solution. Let me know, though, if any of you have better ideas.

SNP table parsing (Greg B.)

Posted on January 10, 2012 by Greg B.

Ask me for the most current version if you want to use any of these!

A few perl scripts that take a SNP table and do the following:
1. Remove unwanted samples and rename the samples
2. Remove sites that do not have enough data
3. Order the sites based on a map

Continue reading →

Population Genomics! (Greg B.)

Posted on January 10, 2012 by Greg B.

Ask me for the most current version of these if you want to use them.

Several people in the lab are now working with very similar data (in structure) and have similar questions (in technique). As I have discussed with several people it would be very useful if we could all build up and share the tools needed to do these analysis. Understandably, people may want to do things on their own or in a specific manner, but I think there are several advantages to building this up together. The main thing is that it will be more efficient in terms of personnel time, having each person re-invent the wheel does not make sense. I think the blog is a great place to set do this. Below we can make our wish list and link to posts for solutions as they become available. As always, the principles (1,2 and 3) covered here apply.

Continue reading →

AWK (Seb)

Posted on January 6, 2012 by Seb

What is awk?

“AWK is a language for processing files of text. A file is treated as a sequence of records, and by default each line is a record. Each line is broken up into a sequence of fields, so we can think of the first word in a line as the first field, the second word as the second field, and so on. An AWK program is of a sequence of pattern-action statements. AWK reads the input a line at a time. A line is scanned for each pattern in the program, and for each pattern that matches, the associated action is executed.” – Alfred V. Aho

Why awk?

1.AWK is simpler to use than most conventional programming languages.
2. It is fast.
3. It has string manipulation functions, so it can search for particular strings and modify the output.
4. A version of the AWK language is a standard feature of nearly every modern Unix-like operating system available today.

Simple examples on how to use AWK:
Continue reading →

SNP summary statistics in R: ‘hierfstat’ is back and better than before! (Rose)

Posted on January 2, 2012 by Rose

After being disabled and not supported for several months, ‘hierfstat’ (by Jerome Goudet) now has lots of useful (and fast) calculations of summary statistics, including expected and observed heterozygosity, Fst and Jost’s Dest.

Continue reading →

STACKS installation (Rose)

Posted on December 12, 2011 by Rose

Installing stacks on Ubuntu Natty Narwhal or Oneiric Ocelot

STACKS is a piece of software produced by Julian Catchen in the Cresko lab. It’s designed to identify loci and alleles from RAD (or GBS) reads either de novo or after alignment to a reference. It consists of several modules that can be run separately, but to completely install it as a pipeline, it relies on a web server, unfortunately. Many of the required instructions are given in the README file, but because nobody in our lab is an expert on this, we had to fiddle around to get the program running on our Ubuntu machines.

Continue reading →

BLAST databases (Seb)

Posted on December 9, 2011 by Seb

I’ve uploaded two BLAST protein databases one nucleotide database here: /Linux/Loren/blast_database

Please look at the README file for more info and append to it if your modify things or add databases. Continue reading →

Ordered Transcriptomes (Greg B.)

Posted on December 8, 2011 by Greg B.

Here are the transcriptome assemblies ordered based on the map Chris made. The columns are contig name, Linkage group, and CentiMorgan.

The Trinity Assembly ordered
The 16000 gene assembly

Greg B.

Removing adaptors, low-complexity and low-quality from fastq (Nolan)

Posted on December 7, 2011 by Nolan

Thuy and I have made several perl scripts to trim Illumina reads, available on the zoology cluster at:

/Linux/Loren/Seq/trimIlluminaFqQual20.pl

/Linux/Loren/Seq/trimIlluminaFqQual20Phred.pl

Continue reading →

Multiple software installs or updates simultaneously (Kathryn)

Posted on November 17, 2011 by Kathryn

This is a great option for setting up new computers with multiple free software packages at the same time: Ninite.

Continue reading →

How to post – code (Dan E.)

Posted on November 16, 2011 by Dan E.

We have a problem sharing code via RLR.

The Problem

Unfortunately WordPress has a list of acceptable file types that it allows to be uploaded to our media library and none of the useful coding file types are on that list. The list is simply a list of acceptable file extensions. This means if you write a useful R script (or perl or python) script and save it with a standard file extension, like .R or .pl, WordPress will not allow you to upload it to the RLR media library so that you can share it via a post.

The Solution

The list of acceptable file extensions can be hacked and I might give it a try but, until I do, you will have to do one of these things:

Change the file extension. If you save your script as a .txt file it will upload fine. You should make it clear in your post what kind of script it is and then people who download it can change the .txt extension to whatever they want.
Put the code in your post. If your script is not too long you can simply copy and paste the code from your text editor into the post editor. The formatting of the code will remain true to the original so users can simply copy and paste it back out into a text editor or R-Studio or wherever. See Rose’s post about plotting STRUCTURE results for an example of this.
Compress your script file. If your script is big you can try zipping it and then uploading the compressed file. Users can then just download and unzip it. [As of November 2011 this hasn’t been tested.]

Dan E.

R script for plotting STRUCTURE results (Q values) (Rose)

Posted on November 16, 2011 by Rose

This is an R Script that plots individual Q values and labels populations. It can be modified to take average group membership from CLUMPP output and/or to import different population names and higher level groupings from elsewhere.

N.B. I haven’t run this on very many data sets, so it will probably need to be tweaked for your results. But please leave a comment if you run into any problems.

Continue reading →

Our favourite text editors (Rose)

Posted on November 9, 2011 by Rose

I hope we can start a conversation about this because a good text editor can make a big difference to a newbie, so PLEASE REPLY!!! I wanted to proselytise about Npp, but it only runs on Windows. So if you use a different OS, please make that BLEEDINGLY OBVIOUS.

Notepad++ (WINDOWS )

I’ve tried a numerous text editors over the years (like Context), but Notepad++ (Npp) is easily my favourite. It only runs on Windows, but I use it to export Unix formatted files routinely. You can set shortcut keys to change formats very easily. Npp can highlight lots of languages, including R, perl and unix. You can also define your own languages for highlighting – I did that to make my Migrate parameter files easier to read.
Continue reading →

screen (Greg B.)

Posted on November 8, 2011 by Greg B.

This is a useful tool for anyone who uses ssh. It allows you to close your terminal and still have processes running. This is very useful if you are using a laptop for example. You can check in on the process from anywhere afterwards.

Continue reading →

bam file to fastq conversion (Chris)

Posted on November 6, 2011 by Chris

The GSC supplies our raw sequence reads as bam files. Some programs will take unaligned bam files as input (the bwa is one), but many still do not. A much more flexible format is FASTQ. Here is a link to bam2fastq, a simple little conversion program:
Continue reading →

password-less ssh login (Nolan)

Posted on November 2, 2011 by Nolan

This is largely copied from http://rcsg-gsir.imsb-dsgi.nrc-cnrc.gc.ca/documents/internet/node31.html

It’s quite easy, too, at least if you are running Linux (not sure how well it works on other systems, but I believe Macs are easy too), and its much more secure than a password. You can actually disable passwords entirely for your account, making it impossible for anyone to log in as you unless doing so from one of your authorized computers.
I would strongly encourage you all to make very long, cumbersome passwords, and then use this approach as you will never have to use your password again as long as you are logging in from one of the machines that already has it’s key copied to your .ssh/authorized_keys
Continue reading →

Filtering unmapped/unaligned reads from SAM files (Rose)

Posted on November 2, 2011 by Rose

This is a post about some time-saving help Chris Grassa gave me.

STACKS (post coming soon) doesn’t deal well with all of the unaligned reads in SAM files, so I tried using PICARD to remove them. However, PICARD doesn’t like the SAM output of BWA, but Chris G showed me how to use the Unix command awk to do it much more easily. This is his command for my file 1076.sam:
Continue reading →

Installing R packages and running scripts from the command line (Seb)

Posted on October 28, 2011 by Seb

Installing R packages from the command line (on the cluster and/or on your own computer)

Note that in this post, lines preceded by the dollar sign ($) mean commands typed directly from your shell session. Lines preceded by the greaterthan sign (>), mean commands typed from an R session.
Continue reading →

Rieseberg Lab Resources

RLR: Technical resources for Rieseberglers

Category Archives: Bioinformatics

Merge SNP calls (Greg B.)

SNP calling with ML (Greg B.)

RAM cache problems on Linux (Nolan)

SNP table parsing (Greg B.)

Population Genomics! (Greg B.)

AWK (Seb)

SNP summary statistics in R: ‘hierfstat’ is back and better than before! (Rose)

STACKS installation (Rose)

BLAST databases (Seb)

Ordered Transcriptomes (Greg B.)

Removing adaptors, low-complexity and low-quality from fastq (Nolan)

Multiple software installs or updates simultaneously (Kathryn)

How to post – code (Dan E.)

The Problem

The Solution

R script for plotting STRUCTURE results (Q values) (Rose)

Our favourite text editors (Rose)

screen (Greg B.)

bam file to fastq conversion (Chris)

Filtering unmapped/unaligned reads from SAM files (Rose)

Installing R packages and running scripts from the command line (Seb)