Better (but not perfect) HMW DNA extraction protocol

Posted on May 31, 2018 by Marco

I wrote some time ago about the protocol I used to prepare HMW DNA for the new HA412 assembly. The advantage of that protocol is that it doesn’t need much tissue to start with, it’s quick and can work quite well. However, it is also quite unreliable, and will sometimes fail miserably.

To prepare HMW DNA for H. anomalus I tried a different protocol, suggested by Allen Van Deynze at UC Davis. They used it on pepper to prepare HMW DNA for 10X linked reads (the same application I had in mind), and obtained fragments of average size ~150-200 Kb. The resulting 10X assembly was quite spectacular (N50 = 3.69 Mbp for a 3.21 Gbp genome) and was recently published. Continue reading →

Comparing aligners

Posted on February 23, 2016 by Greg Owens

When analyzing genomic data, we first need to align to the genome. There are a lot of possible choices in this, including BWA (medium choice), stampy (very accurate) and bowtie2 (very fast). Recently a new aligner came out, NextGenMap. It claims to be both faster and deal with divergent read data better than other methods. Continue reading →

The limits of GBS sample size

Image

I’ve been doing work on a stickleback GBS dataset and we’re trying to figure out how many samples we can cram into a single lane of illumina. I did some analyses which people may find useful. It’s unclear how applicable the recommendations are for sunflower which seems to have more problems than fish.

Take home message, with 25% less data you lose around 15% of the genotype calls, but up to 50% of the SNPs if you use a stringent coverage filter, due to how the lost data is distributed among loci and individuals.

Continue reading →

Sequencing Data Organization Update

Image

I’ve created a skeleton directory structure and included a few example folders so that everyone can get a better idea of how our data will be organized on the new sever. These are not set in stone. A few people have commented on the blog, or in lab meeting, or to me in person, and I’ve taken all of your suggestions into account.

If you feel like the setup here isn’t optimal, please give some feedback. The better we do this now, the more smoothly things will run in the future!

Sequencing Data Curation Part 1

Posted on October 10, 2013 by Evan Morien

With our new data server (Moonrise) up and ready to store our sequences, it is time to start being more careful about where and when we move our most important data, and how we keep track of it. I’ve devised a system for storing our data for those of you who will be accessing it via the servers. Only Chris, Sariel, Frances and I will have write access to the directories where data is stored. If you would like your data stored, you will have to fill out a form which I’ve created which will give us all the information we need to store it in its right place. Here is the form.

This is inserting a little bureaucracy into our system, and it’s going to be a pain, but in the long run it will make things much easier. We currently have data which we had a very difficult time finding because the owner is no longer in the lab. With a system like the new one, that will not happen.

We will store our WGS, RNASeq, and GBS data in separate folders. This will make finding your data easier in most cases.

Here are the directory structures for the three types of data:

WGS -> Species -> Cultivar -> Library -> Experiment -> file_name_including_library_size_and_depth_coverage.fq METADATA.txt

RNASeq -> Experiment (if unnecessary, the following directories can be omitted) -> Species -> Cultivar/Population -> Library -> file_name_including_library_size_and_depth_coverage.fq METADATA.txt

GBS is a little more complex, and we will separate things in two ways.
GBS -> Cut Site/Enzyme (data with different cut sites might be incompatible) -> Data type (population genetics, mapping data) -> From here on things diverge
Pop -> Group* -> files_with_descriptive_names.fq METADATA.txt
Map -> Experiment -> Species -> files_with_descriptive_names.fq METADATA.txt
*groups are based loosely on clades, and on how much data for each species we have (annuus_wild, annuus_cult, argophyllus, bolanderi_exilus, petiolaris_clade, hybrids, perennials, tuberosus_cult)

Generally, file names should include data not encoded in the directory structure, but important enough to be seen when perusing the data. Things like depth of coverage, library size, etc. seem appropriate for all three data types, but for types with which I’m not as familar (GBS), suggestions would be appreciated.

Genotype By Sequencing (GBS) Barcodes

Posted on November 23, 2012 by Rose

Here are GBS_Barcodes and adapters that we currently have in the lab for GBS, for sequencing on an Illumina machine. They were designed using the site Deena Bioinformatics.

This information came from Greg Baute’s blog and I’ve just converted the file to .xls.

SnoWhite Tips and Troubleshooting (Thuy)

Posted on May 18, 2012 by Thuy

Snowhite is a tool for cleaning 454 and illumina reads. There are quite a few gotchas that will take you half a day to debug. This wiki has a lot of good tips.

Snowhite invokes other bioinformatics programs, one of them being TagDust. If you get a segfault error from TagDust, it may be because you are searching for contaminant sequences larger than TagDust can handle. TagDust can only handle maximum 1000 characters per line in the contaminant fasta file and maximum 1000 base contaminant sequence lengths.

A segfault (or segmentation fault) happens when a program accesses the wrong piece of memory. After TagDust hits the 1000 line character/sequence base limit, TagDust keeps trying to access memory past the 1000 memory slots it has allocated. It may try to access non-existent memory locations or off-limits memory locations. You need to edit the TagDust source code so it allocates enough memory for the sequences and does not wander into bad memory locations.

Go into your TagDust source code directory and edit file “input.c”.

Go to line 68:

char line[MAX_LINE];

Change MAX_LINE to a number larger than the number of characters in the longest line in your contaminant fasta file. You probably can skip this step if you are using the NCBI UniVec.fasta files, since the default of 1000 is enough.
Go to line 69:

char tmp_seq[MAX_LINE];

Change MAX_LINE to a number larger than the number of bases in the longest contaminant sequence in your contaminant fasta file. I tried 1000000 with a recent NCBI UniVec.fasta file and it worked for me.

Recompile your TagDust source code
- Delete all the existing executables by executing make clean in the same directory as the Makefile
- Compile all your files again by executing make clean in the same directory as the Makefile
- If you decided to allocate a lot of memory to your arrays, and your program requires > 2GB of memory at compile time, you may run into “relocation truncated to fit: R_X86_64_PC32 against symbol” errors during linkage. This occurs when the compiler is unable to allocate enough space for the program’s statically allocated objects. Edit the Makefile so that

CC = gcc
becomes
CC = gcc -mcmodel=medium

Reference: http://www.obihai.org/2010/05/relocation-truncated-to-fit-rx866432s.html

Guide to Next-Generation Sequencers (Brook)

Posted on April 25, 2012 by Brook

Occasionally I find myself reading a news item or a paper that mentions a particular sequencing platform and scratching my head to remember of what exactly that particular platform is capable. If you ever find yourself in that same boat, the Molecular Ecologist has a very handy and often-updated guide here.

Rieseberg Lab Resources

RLR: Technical resources for Rieseberglers

Tag Archives: NGS

Better (but not perfect) HMW DNA extraction protocol

Comparing aligners

The limits of GBS sample size

Image

Sequencing Data Organization Update

Image

Sequencing Data Curation Part 1

Genotype By Sequencing (GBS) Barcodes

SnoWhite Tips and Troubleshooting (Thuy)

Guide to Next-Generation Sequencers (Brook)