Rose coded this up as a faster and efficient way to combine all the snp calls into one table. I’ve made a few modifications, hopefully its not broken. Updates are likely in the future.
Category Archives: Bioinformatics
SNP calling with ML (Greg B.)
Email for v7. Bug found that printed G’s as C’s and vise versa.
Call SNPs from sam files in a method similar to Hohenlohe et al 2010. Updated to v4 Feb 9. Previous version had a bug.
It is now fixed up for all of BWAs cigars flavors.
This only deals with reads that fit one of the following:
Full alignment (55M)
Soft clip at the start (10S45M)
Soft clip at end (45M10S)
One deletion (25M10D25M)
One insertion (20M10I20M)
This means it ignores reads with a cigar fields that have N, H, P, = or X and it ignores reads with a cigar more complicate then a single soft clip or a single indel. It also does not penalize reads adjacent to indels.
It ignores bases in soft clipped parts of reads
RAM cache problems on Linux (Nolan)
I just wanted to tell you about a common, and as far as I know unavoidable, problem that occurs in all Linux distributions I have worked with on machines that have large memory. The RAM cache fills up and is not emptied properly.
The RAM cache is used extensively when large files are read and written, and for some reason the RAM cache gradually fills up due to file transfers and is not cleared very quickly. In some cases when another program requires memory the memory just isn’t there – you can see now on hulk no memory is being used by programs, yet over 100 Gb is being used by the cache, so only about half of the RAM is listed as available. In many cases this can use up so much RAM that the computer has to switch to using SWAP space, which happens regularly on the zoology cluster and all of our high-memory machines. I have seen this on RedHat and related, Ubuntu, and OpenSuse, among others. The only way I have found to get rid of this problem is to manually clear the cache, which only works on our machines as you need root access. Although this is not an ideal solution it hasn’t caused me problems so far. When using high-RAM programs I frequently clear the RAM cache to make sure the Free Memory stays at the max. To do that:
sudo su
sync; echo 3 > /proc/sys/vm/drop_caches
exit
Don’t worry if a file transfer is occurring, the sync command protects it. But, don’t alter these commands unless you are sure you know what you are doing. I have talked to many very smart people about this and nobody has come up with a better solution. Let me know, though, if any of you have better ideas.
SNP table parsing (Greg B.)
Ask me for the most current version if you want to use any of these!
A few perl scripts that take a SNP table and do the following:
1. Remove unwanted samples and rename the samples
2. Remove sites that do not have enough data
3. Order the sites based on a map
Population Genomics! (Greg B.)
Ask me for the most current version of these if you want to use them.
Several people in the lab are now working with very similar data (in structure) and have similar questions (in technique). As I have discussed with several people it would be very useful if we could all build up and share the tools needed to do these analysis. Understandably, people may want to do things on their own or in a specific manner, but I think there are several advantages to building this up together. The main thing is that it will be more efficient in terms of personnel time, having each person re-invent the wheel does not make sense. I think the blog is a great place to set do this. Below we can make our wish list and link to posts for solutions as they become available. As always, the principles (1,2 and 3) covered here apply.
AWK (Seb)
What is awk?
“AWK is a language for processing files of text. A file is treated as a sequence of records, and by default each line is a record. Each line is broken up into a sequence of fields, so we can think of the first word in a line as the first field, the second word as the second field, and so on. An AWK program is of a sequence of pattern-action statements. AWK reads the input a line at a time. A line is scanned for each pattern in the program, and for each pattern that matches, the associated action is executed.” – Alfred V. Aho
Why awk?
1.AWK is simpler to use than most conventional programming languages.
2. It is fast.
3. It has string manipulation functions, so it can search for particular strings and modify the output.
4. A version of the AWK language is a standard feature of nearly every modern Unix-like operating system available today.
Simple examples on how to use AWK:
Continue reading
SNP summary statistics in R: ‘hierfstat’ is back and better than before! (Rose)
After being disabled and not supported for several months, ‘hierfstat’ (by Jerome Goudet) now has lots of useful (and fast) calculations of summary statistics, including expected and observed heterozygosity, Fst and Jost’s Dest.
STACKS installation (Rose)
Installing stacks on Ubuntu Natty Narwhal or Oneiric Ocelot
STACKS is a piece of software produced by Julian Catchen in the Cresko lab. It’s designed to identify loci and alleles from RAD (or GBS) reads either de novo or after alignment to a reference. It consists of several modules that can be run separately, but to completely install it as a pipeline, it relies on a web server, unfortunately. Many of the required instructions are given in the README file, but because nobody in our lab is an expert on this, we had to fiddle around to get the program running on our Ubuntu machines.
BLAST databases (Seb)
I’ve uploaded two BLAST protein databases one nucleotide database here: /Linux/Loren/blast_database
Please look at the README file for more info and append to it if your modify things or add databases. Continue reading
Ordered Transcriptomes (Greg B.)
Here are the transcriptome assemblies ordered based on the map Chris made. The columns are contig name, Linkage group, and CentiMorgan.
The Trinity Assembly ordered
The 16000 gene assembly
Greg B.
Removing adaptors, low-complexity and low-quality from fastq (Nolan)
Thuy and I have made several perl scripts to trim Illumina reads, available on the zoology cluster at:
/Linux/Loren/Seq/trimIlluminaFqQual20.pl
/Linux/Loren/Seq/trimIlluminaFqQual20Phred.pl
Multiple software installs or updates simultaneously (Kathryn)
This is a great option for setting up new computers with multiple free software packages at the same time: Ninite.
How to post – code (Dan E.)
We have a problem sharing code via RLR.
The Problem
Unfortunately WordPress has a list of acceptable file types that it allows to be uploaded to our media library and none of the useful coding file types are on that list. The list is simply a list of acceptable file extensions. This means if you write a useful R script (or perl or python) script and save it with a standard file extension, like .R or .pl, WordPress will not allow you to upload it to the RLR media library so that you can share it via a post.
The Solution
The list of acceptable file extensions can be hacked and I might give it a try but, until I do, you will have to do one of these things:
- Change the file extension. If you save your script as a .txt file it will upload fine. You should make it clear in your post what kind of script it is and then people who download it can change the .txt extension to whatever they want.
- Put the code in your post. If your script is not too long you can simply copy and paste the code from your text editor into the post editor. The formatting of the code will remain true to the original so users can simply copy and paste it back out into a text editor or R-Studio or wherever. See Rose’s post about plotting STRUCTURE results for an example of this.
- Compress your script file. If your script is big you can try zipping it and then uploading the compressed file. Users can then just download and unzip it. [As of November 2011 this hasn’t been tested.]
Dan E.
R script for plotting STRUCTURE results (Q values) (Rose)
This is an R Script that plots individual Q values and labels populations. It can be modified to take average group membership from CLUMPP output and/or to import different population names and higher level groupings from elsewhere.
N.B. I haven’t run this on very many data sets, so it will probably need to be tweaked for your results. But please leave a comment if you run into any problems.
Our favourite text editors (Rose)
I hope we can start a conversation about this because a good text editor can make a big difference to a newbie, so PLEASE REPLY!!! I wanted to proselytise about Npp, but it only runs on Windows. So if you use a different OS, please make that BLEEDINGLY OBVIOUS.
Notepad++ (WINDOWS )
I’ve tried a numerous text editors over the years (like Context), but Notepad++ (Npp) is easily my favourite. It only runs on Windows, but I use it to export Unix formatted files routinely. You can set shortcut keys to change formats very easily. Npp can highlight lots of languages, including R, perl and unix. You can also define your own languages for highlighting – I did that to make my Migrate parameter files easier to read.
Continue reading
screen (Greg B.)
This is a useful tool for anyone who uses ssh. It allows you to close your terminal and still have processes running. This is very useful if you are using a laptop for example. You can check in on the process from anywhere afterwards.
bam file to fastq conversion (Chris)
The GSC supplies our raw sequence reads as bam files. Some programs will take unaligned bam files as input (the bwa is one), but many still do not. A much more flexible format is FASTQ. Here is a link to bam2fastq, a simple little conversion program:
Continue reading
password-less ssh login (Nolan)
This is largely copied from http://rcsg-gsir.imsb-dsgi.nrc-cnrc.gc.ca/documents/internet/node31.html
It’s quite easy, too, at least if you are running Linux (not sure how well it works on other systems, but I believe Macs are easy too), and its much more secure than a password. You can actually disable passwords entirely for your account, making it impossible for anyone to log in as you unless doing so from one of your authorized computers.
I would strongly encourage you all to make very long, cumbersome passwords, and then use this approach as you will never have to use your password again as long as you are logging in from one of the machines that already has it’s key copied to your .ssh/authorized_keys
Continue reading
Filtering unmapped/unaligned reads from SAM files (Rose)
This is a post about some time-saving help Chris Grassa gave me.
STACKS (post coming soon) doesn’t deal well with all of the unaligned reads in SAM files, so I tried using PICARD to remove them. However, PICARD doesn’t like the SAM output of BWA, but Chris G showed me how to use the Unix command awk to do it much more easily. This is his command for my file 1076.sam:
Continue reading
Installing R packages and running scripts from the command line (Seb)
Installing R packages from the command line (on the cluster and/or on your own computer)
Note that in this post, lines preceded by the dollar sign ($) mean commands typed directly from your shell session. Lines preceded by the greaterthan sign (>), mean commands typed from an R session.
Continue reading