Comparing aligners

When analyzing genomic data, we first need to align to the genome. There are a lot of possible choices in this, including BWA (medium choice), stampy (very accurate) and bowtie2 (very fast). Recently a new aligner came out, NextGenMap. It claims to be both faster and deal with divergent read data better than other methods. Continue reading

Phred score detector

When using old sequence data you have to know what phred score system it is using. Almost all new data is going to use Phred33, but the old stuff could also be Phred64. There are scripts on this website that have ways of automatically detecting it

I took the perl script from this website and parred down the output so that it works in pipelines. It outputs either “phred33” or “phred64”

An example pipeline in bash is:

coding=”$(perl $bin/ ${name}_1.fq)”

java -jar trimmmomatic-0.32.jar -$coding sample1.fq ….etc

NOTE: Phred64 should only go up to 104. There are some RNA seq samples (and probably others) that go up to 105. The original script output an error, while my version just says phred64. I hope that running it through phred conversion in trimmomatic fixes the issue but I am not sure.

Home-brew WGS library multiplexing

There are two main ways to barcode WGS libraries so that they can be run together on a same lane:

– In-line barcodes: unique sequences are located at the very end of one or both adapters. This sequence will be at the very beginning of each read from a given library. This is the barcode system that is normally used fro GBS libraries as well.

– Indices: barcodes are in the middle of one or both adapters. These barcodes are read through an independent round of sequencing. For a paired-end library you would have therefore two rounds of sequencing of your fragment and a third round of sequencing for the index (and I guess a fourth one as well, if you have double indices). This is the system used in most commercial kits.

Continue reading

The limits of GBS sample size


I’ve been doing work on a stickleback GBS dataset and we’re trying to figure out how many samples we can cram into a single lane of illumina. I did some analyses which people may find useful. It’s unclear how applicable the recommendations are for sunflower which seems to have more problems than fish.

Take home message, with 25% less data you lose around 15% of the genotype calls, but up to 50% of the SNPs if you use a stringent coverage filter, due to how the lost data is distributed among loci and individuals.

(Probably the closest you can get to) Home-brew Illumina WGS libraries

As some of you might know, I have been working for the last few months on optimizing a protocol for Illumina WGS libraries that will reduce our dependency on expensive kits without sacrificing quality. The ultimate goal would be to be able to use WGS libraries as a more expensive but hopefully more informative alternative genotyping tool to GBS. Getting to that point ideally requires to develop:

1) A cheaper alternative for library preparation (this post)

2) A reliable multiplexing system (this other post)

3) A way to shrink the sunflower genome before sequencing it (because, as you know, it’s rather huge) (yet another post)

The following protocol is for non-multiplexed libraries. The protocol for multiplexed ones is actually identical, you just need to change adapters and PCR primers – more about that in the multiplexing post.

If you are planning to pool libraries and deplete them of repetitive elements, read carefully all three posts before starting your libraries (mostly because you might need to use different adapters and PCR primers)

Continue reading

SnoWhite Tips and Troubleshooting (Thuy)

Snowhite is a tool for cleaning 454 and illumina reads.  There are quite a few gotchas that will take you half a day to debug.  This wiki has a lot of good tips.

Snowhite invokes other bioinformatics programs, one of them being TagDust.  If you get a segfault error from TagDust, it may be because you are searching for  contaminant sequences larger than TagDust can handle.  TagDust can only handle maximum 1000 characters per line in the contaminant fasta file and maximum 1000 base contaminant sequence lengths.

A segfault (or segmentation fault) happens when a  program accesses the wrong piece of memory.  After TagDust hits the 1000 line character/sequence base limit, TagDust keeps trying to access memory past the 1000 memory slots it has allocated.  It may try to access non-existent memory locations or off-limits memory locations.  You need to edit the TagDust source  code so it allocates enough memory for the sequences and does not wander into bad memory locations.

  • Go into your TagDust source code directory and edit file “input.c”.
  • Go to line 68:

char line[MAX_LINE];

  • Change MAX_LINE to a number larger than the number of characters in the longest line in your contaminant fasta file.  You probably can skip this step if you are using the NCBI UniVec.fasta files, since the default of 1000 is enough.
  • Go to line 69:

char tmp_seq[MAX_LINE];

  • Change MAX_LINE to a number larger than the number of bases in the longest contaminant sequence in your contaminant fasta file.  I tried 1000000 with a recent NCBI UniVec.fasta file and it worked for me.
  • Recompile your TagDust source code
    • Delete all the existing executables by executing  make clean in the same directory as the Makefile
    • Compile all your files again by executing make clean in the same directory as the Makefile
    • If you decided to allocate a lot of memory to your arrays, and your program requires > 2GB of memory at compile time, you may run into “relocation truncated to fit: R_X86_64_PC32 against symbol” errors during linkage.  This occurs when the compiler is unable to allocate enough space for the program’s statically allocated objects.  Edit the Makefile so that

CC = gcc
CC = gcc -mcmodel=medium

Making Illumina Whole Genome Shotgun Sequencing Libraries – (Dan E.)

I’ve been making whole genome shotgun sequencing libraries (for the purposes of this post: WGSS libraries) to sequence sunflower genomes on the Biodiversity Centre’s Illumina HiSeq. I haven’t been doing it for very long and its likely that my approach will change in the future as costs and products change but, as of early 2012, I’ve landed on a hybrid protocol based on kits from an outfit called Bioo Scientific. I use the Bioo Sci. adapter kit and their library prep kit up to the final PCR step at which point I switch to a PCR kit from another outfit called KAPA. I also use a KAPA kit to quant libraries with qPCR. In this post I give a little context then describe what I do to make WGSS libraries . . . Continue reading

Illumina Sequencing Adapters and Barcodes (Dan E.)

As of March 2012 we are using the Bioo Scientific NEXTflex barcoded adapters for WGS sequencing libraries made by ourselves, (well me so far). The set we are currently using comprises 48 barcodes, so we can multiplex up to a 48-plex in one lane on the Illumina HiSeq sequencer.

Bioo Sci. 48 barcoded adapters

Below are the sequences of the Illumina adapters and the 48 barcodes we are currently using. Continue reading