Blast2GO

This describes how you can run blast2go on a server using b2gpipe and a local database. This makes blast2go a viable option for annotating large fasta files. Otherwise it is much too slow. The database is currently set up on an AdapTree server. This took a while for me to troubleshoot, so you could run into different problems, but you will hopefully avoid some of the issues I ran into. The b2g Google group is good for troubleshooting. You can find many of these instructions at http://www.blast2go.com/b2glaunch/resources/35-localb2gdb

Continue reading

Allowed File Types At RLR – you can now upload scripts with their usual file extensions

Hello All,

Many of us have been annoyed by the restricted file types that WordPress allows to be uploaded to RLR. It’s especially annoying because all WordPress is doing when it permits or denies an upload is checking the file extension against a list of allowable extensions. {Even the most malicious code could be uploaded to our blog as long as it had a .txt file extension. Whether that code could then be made to execute, however, is far beyond my web-programming grasp – WordPress would treat it as plain text so it may be impossible.}

We’ve been sharing code via RLR by sidestepping the file extension rules and uploading scripts as .txt text files or by compressing files into zip archives or just putting the code itself into posts. Admittedly these were simple solutions, but now it’s even simpler – I just added some of the relevant file extensions to the list that RLR will allow for upload.

I added: “.pl”, “.py”, “.sh”, “.R”, “.r” and “.kml”.

Any file with one of those extensions will upload as plain text, i.e. WordPress will treat it as a text file.

If I’ve omitted something useful let me know.

Please remember that code can simply be copied into the body of a post and that will often be the best way to share it. But, in addition to that presentation, and especially for long scripts, you can now upload the script with its file extension to the RLR media library and put a link to it in your helpful post explaining what it does.

Dan.

CheapEasy DIY Barcodes in R

I couldn’t believe how expensive the software was for writing barcodes, so I wrote a short program in R to do it for FREE. And, frankly it should be faster and easier if you already have your labels in an Excel file. You don’t really need to understand the program or even R functions to use it, as long as you know how to run an R program.

Setup and Overview:

[UPDATED (see notes below)] – R-code. Start with this (Note I could not upload a .R file, so this is .txt but still an R program).

Input – barcodes128.csv – You need this file to run the program. Save it in your working directory (see comments in R code for how to set this). AND labels.csv – This is a sample file showing the format for your labels. Even though it’s a .csv, it is a single column with each label as a separate row, so there are no actual commas

Output – BarcodesOut.pdf – A sample output: a pdf file for the 0.5″x1.75″ Worth Poly Label WP0517 (Polyester Label Stock), currently in the lab

That’s really all you need to know, everything that follows is extraneous info. If you have any problems, check out the Detailed Instructions, Troubleshooting Tips, or add a comment below. Continue reading

Old lab PC – new Ubuntu computer

I’ve installed the latest version of Ubuntu (12.04) on the old PC lab computer:

-Username, computer name and password are written on the computer itself, if needed.
-I’ve also installed on it a few of my favorite programs (LibreOffice, Inkscape, Gimp, R, Chrome).
-It boots in about 35 seconds, not bad for an “old piece of junk”!

Feel free to use it!

seb

SnoWhite Tips and Troubleshooting (Thuy)

Snowhite is a tool for cleaning 454 and illumina reads.  There are quite a few gotchas that will take you half a day to debug.  This wiki has a lot of good tips.

Snowhite invokes other bioinformatics programs, one of them being TagDust.  If you get a segfault error from TagDust, it may be because you are searching for  contaminant sequences larger than TagDust can handle.  TagDust can only handle maximum 1000 characters per line in the contaminant fasta file and maximum 1000 base contaminant sequence lengths.

A segfault (or segmentation fault) happens when a  program accesses the wrong piece of memory.  After TagDust hits the 1000 line character/sequence base limit, TagDust keeps trying to access memory past the 1000 memory slots it has allocated.  It may try to access non-existent memory locations or off-limits memory locations.  You need to edit the TagDust source  code so it allocates enough memory for the sequences and does not wander into bad memory locations.

  • Go into your TagDust source code directory and edit file “input.c”.
  • Go to line 68:

char line[MAX_LINE];

  • Change MAX_LINE to a number larger than the number of characters in the longest line in your contaminant fasta file.  You probably can skip this step if you are using the NCBI UniVec.fasta files, since the default of 1000 is enough.
  • Go to line 69:

char tmp_seq[MAX_LINE];

  • Change MAX_LINE to a number larger than the number of bases in the longest contaminant sequence in your contaminant fasta file.  I tried 1000000 with a recent NCBI UniVec.fasta file and it worked for me.
  • Recompile your TagDust source code
    • Delete all the existing executables by executing  make clean in the same directory as the Makefile
    • Compile all your files again by executing make clean in the same directory as the Makefile
    • If you decided to allocate a lot of memory to your arrays, and your program requires > 2GB of memory at compile time, you may run into “relocation truncated to fit: R_X86_64_PC32 against symbol” errors during linkage.  This occurs when the compiler is unable to allocate enough space for the program’s statically allocated objects.  Edit the Makefile so that

CC = gcc
becomes
CC = gcc -mcmodel=medium

Compiled Sunflower QTLs (GregO)

Last year I worked on a project to see if any of the domestication outlier genes were found with previously mapped QTLs. The project ultimately fell flat when new data showed that the outlier I was working on wasn’t an outlier, but I did compile a large table of sunflower QTLs which may be useful. The table has 369 mapped QTLs.

I’ve shared this with a couple of people, but I’m posting it here on a google doc for everyone to use. Here is the link: https://docs.google.com/spreadsheet/ccc?key=0AgfXIvTZMEqPdHdJWTk3UVlVa3dkdGFTak9ySlUtNkE

A couple notes:
-It was compiled about a year ago, so it may be out of date. Also, although I tried to include every applicable study, I may have missed some. If you do find a study that I missed, I encourage you to add it to the table.
-It is only from annuus crosses, and a majority are domestics
-The position values are in cM

Anyway, read and enjoy. Change it if you find errors or new papers!

Jaatha – training data sets (Rose)

I’ve generated three training data sets, which will save you around 5 days if you decide to run Jaatha, a molecular demography program. It uses the joint site frequency spectrum of two populations to model various aspects of population history (split time, population size and growth, migration). Here’s the paper: Naduvilezhath et al 2011.

1. Using the default model, with the following maxima: tmax=20, mmax=5, qmax=10.

2. Alternative maxima: tmax=5, mmax=20, qmax=20.

3. Alternative maxima: tmax=5, mmax=20, qmax=5.

They can’t be uploaded because they’re compressed R data structures, but let me know if you’d like to give them a whirl.

Illumina Sequencing Adapters and Barcodes (Dan E.)

As of March 2012 we are using the Bioo Scientific NEXTflex barcoded adapters for WGS sequencing libraries made by ourselves, (well me so far). The set we are currently using comprises 48 barcodes, so we can multiplex up to a 48-plex in one lane on the Illumina HiSeq sequencer.

Bioo Sci. 48 barcoded adapters

Below are the sequences of the Illumina adapters and the 48 barcodes we are currently using. Continue reading

Using RSEM to estimate gene and isoform expression (Sam)

RSEM is a relatively new bioinformatics tool that has been developed in conjunction with Trinity for the analysis of RNAseq data. RSEM can be used to estimate expression levels for both genes and different isoforms of genes, and is quite quick and easy to use, with an excellent google group for help (“RSEM users”). All it requires is an RNAseq dataset (either fasta or fq format) and a reference transcriptome that it can be aligned to.

Continue reading

Turning your SNP table into a STRUCTURE input file (Brook)

I suspect there are probably several homemade versions of this kind of script kicking around, but here is a perl script I’ve written for turning your SNP table into a STRUCTURE input file. To use it, you should change the .txt to a .pl after downloading the script. More on STRUCTURE input files (and so much more!) is in the documentation here.

Continue reading

If BWA wants *.nt.ann file… (Rose)

Recently BWA (an alignment program) suddenly started giving a strange error message, indicating that a reference file ending in *.nt.ann was missing. This file type was unfamiliar to me, with good reason: it’s a colourspace reference file, which shouldn’t be generated when we index the fasta-based references we’re using (at least, I don’t know of anyone in our lab using SOLID data as a reference). DO NOT rebuild the reference with the -c (colourspace) flag, as you might see suggested on the web, because we don’t know what effect that might have on our alignments. DO rebuild it with the usual settings.

Bioportal (Rose)

Bioportal is a free computing resource that provides several applications in our area. I’ve been running STRUCTURE on both the “low priority” and normal queues and it’s been fantastic (unlike Westgrid, who haven’t even responded to my application). For those of you who are struggling to find room on the cluster, it might be useful to you too. Much as I’d like to keep it to myself and exploit the hell out of it, here’s the address:

https://www.bioportal.uio.no/