Where does all the GBS data go?

Why do we get seemingly few SNPs with GBS data?

Methods: Used bash to count the occurrences of tags. Check your data and let me know what you find:

cat Demutliplexed.fastq | sed -n '2~4p' | cut -c4-70 | sort | uniq -c | grep -v N | sed 's/\s\s*/ /g' > TagCounts

Terminology: I have tried to use tags to refer to a particular sequence and reads as occurrences of those sequences. So a tag may be found 20 times in a sample (20 reads of that sequence).

Findings: Tag repeat distribution

Probably the key thing and a big issue with GBS is the number of times each tag gets sequence. Most sites get sequenced once but most tags get sequence many many times. It is hard to see but here are histograms of the number of times each tags occur for a random sample (my pet11): All tags histogram (note how long the x axis is), 50 tags or less Histogram. Most tags occur once – that is the spike at 1. The tail is long and important. Although only one tag occurs 1088 times it ‘takes up’ 1088 reads.

How does this add up?

In this sample there are 3,453,575 reads. These reads correspond to 376,036 different tag sequences. This would mean (ideally) ~10x depth of each of the tags. This is not the case. There are a mere 718 tags which occur 1000 or more times but they account for 1394905 reads. That is 40% of the reads going to just over 700 (potential) sites. I’ve not looked summarized more samples using the same method but I am sure it would to yield the same result.

Here is an example: Random Deep Tag. Looking at this you can see that the problem is worse than just re-sequencing the same tag many times but you introduce a large number of tags that are sequencing and/or PCR errors that occur once or twice (I cut off many more occurances here).

Conclusion: Poor/incomplete digestion -> few sites get adapters -> they get amplified like crazy and then sequenced.

Update 1:

Of the > 3.4Million tags that are in the set of 18 samples I am playing with only 8123 are in 10 or more of the samples.

For those sites with >10 scored the number of times a tag is sequenced is correlated between samples. The same tags are being sequenced repeatedly in the different samples.

Update 2:

As requested the ‘connectivity’ between tags. This is the number of 1bp miss matches each tag has. To be included in this analysis a tag must occur at least 5 times in 3 individuals. Here is the figure. So most tags have zero matches a smaller number have one and so on. This actually looks pretty good – at least how I am thinking about it now. This could mean that the filtering criteria works. If errors were still being included I would expect tags with one match (the actual sequence) to occur more than ones with zero.

Quick phylogenetic trees colored by trait in R

Basically, here is some easy code to search publicly available databases to determine a trait value for a species (in this case whether or not the species was recorded as invasive in any of 5 global invasive species databases), then make a quick tree based on published phylogenies using Phylomatic (http://phylodiversity.net/phylomatic/), then color code the tree based on the trait value.

Subset of species from Asteraceae and whether they have been reported as invasive ('weedy') in any of 5 global invasive species databases.

Subset of species from Asteraceae and whether they have been reported as invasive (‘weedy’) in any of 5 global invasive species databases.

Continue reading

96-well plates CTAB DNA extraction

When I was working with Arabidopsis, 96-well CTAB DNA extraction was my best friend, and I spent many days extracting away tens of thousands of samples. Good times.

DNA extraction is much less pleasant in sunflower, but since I was reasonably happy with the results of single-tube 3% CTAB DNA extractions, I though I would try to scale it up to a 96-well plate format. Results of earlier attempts, with the participation of Brook and Cris, ranged from inconsistent to disastrous. Things all but improved when I tried again after coming back from Utah with a few hundreds dried samples. Since though the prospect of extracting them all one by one didn’t sound very attractive, I put some more effort into improving the protocol, and now it works quite nicely.

Continue reading

Sunflower seed sterilization

Hi all,

Here is a seed sterilization protocol that surprised me – No fungus visible after 4 -5 days of growth on nutritious media.

In 100 ml distilled water mix the following

1 g sparkleen soap powder (dish washing stuff) –> This is the top end for sparkleen.  You may need to use less.
2 mL bleach (final concentration = 2%)
2 ml PPM. (final concentration = 2%)

Sterilize in 15 ml tubes.

Rinse seeds  with autoclaved water 3x.

Results: Normally the first image would contain many fungal blooms.  Not so in the images below.

Friday November 21st

WP_20131122_15_51_06_Pro

 

Monday November 26thWP_20131125_001

Zoology email forwarding

Hi All,
Here is a link to get your @zoology email to forward to another email (gmail).
https://sun.zoology.ubc.ca:442/cgi-bin/admin/forward.cgi

Use settings (the gear thing in the top right corner) -> accounts -> make default to get gmail to send as your @zoology.

Also, you can use “biodiversity” as an alias for your zoology account. For example:

bob@zoology.ubc.ca is functionally the same as bob@biodiversity.ubc.ca

@botany accounts are completely different.

 

Plant DNAzol

Hi all,

Here’s another method for DNA extraction.  The blog is stuffed to the gills with DNA extraction methods.  The current standards are ‘Qiagen-like’ columnless – used to generate the DNA for the genome sequencing project and CTAB.  I add this protocol because itis
easy and it is effective.

I extracted from Ha89 and harvested ~ 115 ng/uL from 20 cm tall plants.  I also purposely ‘took it slow,’ letting tissue thaw after freezing to see PlantDNAzol’s efficacy.  It’s efficient.

PlantDNAzol-uncut

Quality is good

The gel to the left:

Right most lane contains Ha89 genomic DNA – 10 uL loaded of 70 at 115 ng/uL DNA – 260/280 was 1.78.  260/230 was ~1.00 (I suspect I could have added an additional ethanol wash to remove Guandine from the DNA mixture.

Is the DNA useful?  Can downstream reactions proceed?

Yes.  I digested the DNA with a methylation sensitive restriction enzyme, PstI and a methylation insensitive enzyme, EcoRV

PlantDNAzolcut-22hrs

<–The gel to the left :

Leftmost lane  Ladder

A vector digested with PstI,

Ha89 gDNA digested with PstI 240 minutes,

Ha89 gDNA digested with EcoRV 240 minutes.

The DNA digests.  But does it contain contaminants that upset the enzymes over long incubations?  Overnight?

PlantDNAzolcut-240minutes

Gel above:  From left- Ladder, unrelated vector digest,

Ha89 gDNA digested with PstI 22 hours

Ha89 gDNA digested with EcoRV 22 hours.

Protocol for DNAzol extraction (exactly as published by LifeTech but easier to follow – http://tools.lifetechnologies.com/content/sfs/manuals/10978.pdf:

Have these items on hand:

1 0.6 mL DNAzol per 100 mg sample

2. 0.3 mL chloroform per 100 mg sample

3. Timer

4. 100% ethanol (0.225 mL per sample),

5. 75% ethanol (0.3 mL per sample)

Handle all inversions carefully.  When you see invert or shake handle your samples gently

1. Mix 100 mg ground tissue with 0.3 mL PlantDNAzol – 100 mg is max.  Overdoing will hurt your yield.

2. Invert gently to aid in lysis and dispersion

3. Once completely dispersed incubate at RT, 5 min, shake periodically.

4. Add 0.3 mL chloroform and mix.

5. Once completely dispersed incubate at RT, 5 min, shake periodically.

6. Centrifuge at RT, 12 000 g (NOT rpm) 10 min

7. Harvest the supernatant. – you’ll see a phenol/chloroform styled triple layer.  The middle layer will be pulpy containing your cellulosic debris and proteins.  Don’t collect the middle layer.  Less is more

8. Mix supernatant from 7 with 225 uL 100% ETOH.

9. Incubate at RT, 5 min

10. Centrifuge mixture 5000 g 4 min – get preparing for step 11

11. Make a PlantDNAzol – Ethanol mixture:  For one sample mix 0.3 mL PlantDNAzol with 0.225 mL 100% ethanol.

12. Discard the supernatant from 10 and mix it with 0.3 mL of the mixture prepared in step number 11

13. Incubate as in step 9.

14. Spin as in step 10.

15. Pour off supernatant

16 Wash pellet with 75% ethanol – 0.3 mL – this step can be repeated if your 260/230 isn’t adequate.  Guanidine absorbs strongly in 230 nm wavelength

EDIT: repeat step 16 for a total of 2 washes.

17. Spin as in step 10.

18. Remove supernatant – if your samples are green repeat step 16.

19. Resolubilize your DNA in TE or NaOH.  Make sure to run your bead of TE over the wall of the tube to collect your DNA.

 

EDIT – Less is more as is usually the case with DNA extraction.  I harvested from 38, 60, and 100 mg of tissue.  – A sweet spot for tissue quantity is 45 to 60 mg for the given amount of Plant DNAzol

Qubit quantifiication:  45 mg of tissue yielded 264 ng/uL ug/mL.  60 mg, 305 ng/uL.  100 mg, 12.4 ng/uL

Normalize/quantify your WGS libraries

Regardless of what method you use to make your Illumina libraries, if you added barcodes or indices you will need to normalize them before pooling (or otherwise have probably very uneven coverage). Or, you might want to know the exact molarity of your library before sending it for sequencing (although the need for that in our case is debatable, see later). The most accurate way to do both is probably by qPCR. Continue reading

Turning STACKS output into IMa2 input files

This script extract sequence haplotypes from the “alleles.tsv” files generated by STACKS and does some light filtering (you may want to add more). It’s very similar to the one I used for our 2013 Molecular Ecology paper, and still has some Great Sand Dunes-specific parameter names, but should work ok for other data sets. Oh, and I was using the “pstacks” reference-guided workflow in a slightly older version STACKS, in case that matters.

extract_haplotype_sequences_v4_annotated.r

example_alleles.tsv

Please let me know if you use this script and whether it needs tweaking.

(Probably the closest you can get to) Home-brew Illumina WGS libraries

As some of you might know, I have been working for the last few months on optimizing a protocol for Illumina WGS libraries that will reduce our dependency on expensive kits without sacrificing quality. The ultimate goal would be to be able to use WGS libraries as a more expensive but hopefully more informative alternative genotyping tool to GBS. Getting to that point ideally requires to develop:

1) A cheaper alternative for library preparation (this post)

2) A reliable multiplexing system (this other post)

3) A way to shrink the sunflower genome before sequencing it (because, as you know, it’s rather huge) (yet another post)

The following protocol is for non-multiplexed libraries. The protocol for multiplexed ones is actually identical, you just need to change adapters and PCR primers – more about that in the multiplexing post.

If you are planning to pool libraries and deplete them of repetitive elements, read carefully all three posts before starting your libraries (mostly because you might need to use different adapters and PCR primers)

Continue reading

Introducing Phoebanthus, Sibling to Sunflower

Before I left UBC to head to California, Rose and I got interested in looking closely at the nearest relatives of Sunflower. In particular, Rose was looking to obtain an “outgroup” for her analyses of cpDNA phylogeny in the sunflowers. We found out that the sister-genus to sunflower is a little plant called Phoebathus, which consists of just two species, one diploid and one tetraploid. Both of them are perennials that are endemic to Florida.

phoebanthus-ripest-014

Phoebanthus grandiflorus

After learning about the plant, I started looking for a way to get some samples. I learned very quickly that it’s basically not cultivated at all. So I contacted several naturalists from Florida who live near Phoebanthus country, and one of them (a gentleman named Wayne Matchett) volunteered to get us some tissue and seeds for the more common tetraploid species, Phoebanthus grandiflorus. It took a while for Wayne to locate a flowering population, and then to wait for the seed heads to mature (at my recommendation, he “bagged” the heads), but he finally managed to secure about 100 mature seeds, along with a sample of leaf tissue, both of which I am now in possession of.

seeds-005

Some of the seeds that Wayne got

I guess that I became obsessed with this plant because it is such an underdog compared to Helianthus. While Helianthus is a weedy, widespread, diverse, and dominant genus that has more or less conquered North America as well as the human race, the sibling genus to sunflower amounts to just two species, both of them found in what is probably the cushiest, least stringent environment in all of North America: Florida (sorry, Chris).

In any case, I’m going to apply for a permit to bring the seeds and tissue to UBC when I visit in a month or so. I might also try to keep some here in California and try growing it here to see how it performs. In addition to providing a nice outgroup for phylogenetic analyses, it might be cool to do other comparisons between the vivacious head-turner that is Helianathus, and its runty little sister genus, Phoebanthus.

In the meantime, here are some of Wayne’s photos:

phoebanthus-closeup

phoebanthus-close-010

phoebanthus-seedhead-030

phoebanthus-wide-037

Tissue dessication with table salt

There is a new paper published by Elena Carrió and Josep A. Roselló online early in Molecular Ecology Resources that suggests salt dessication of leaves dehydrates and prevents decay at levels similar to that of silica gel, with similar PCR results.

Large-grain silica is probably still the best option, but this would come in really handy if you come across something interesting that you want to collect but don’t happen to have silica gel with you.

Here is the main figure (link to the paper below):

Thanks to Maggie Wagner from the TMO lab who found this paper and sent it around. Here is a link to the full article:

http://onlinelibrary.wiley.com/doi/10.1111/1755-0998.12170/full

DOI: 10.1111/1755-0998.12170

Sequencing Data Organization Update

Image

I’ve created a skeleton directory structure and included a few example folders so that everyone can get a better idea of how our data will be organized on the new sever. These are not set in stone. A few people have commented on the blog, or in lab meeting, or to me in person, and I’ve taken all of your suggestions into account.

If you feel like the setup here isn’t optimal, please give some feedback. The better we do this now, the more smoothly things will run in the future!

Sequencing Data Curation Part 1

With our new data server (Moonrise) up and ready to store our sequences, it is time to start being more careful about where and when we move our most important data, and how we keep track of it. I’ve devised a system for storing our data for those of you who will be accessing it via the servers. Only Chris, Sariel, Frances and I will have write access to the directories where data is stored. If you would like your data stored, you will have to fill out a form which I’ve created which will give us all the information we need to store it in its right place. Here is the form.

This is inserting a little bureaucracy into our system, and it’s going to be a pain, but in the long run it will make things much easier. We currently have data which we had a very difficult time finding because the owner is no longer in the lab. With a system like the new one, that will not happen.

We will store our WGS, RNASeq, and GBS data in separate folders. This will make finding your data easier in most cases.

Here are the directory structures for the three types of data:

WGS -> Species -> Cultivar -> Library -> Experiment -> file_name_including_library_size_and_depth_coverage.fq METADATA.txt

RNASeq -> Experiment  (if unnecessary, the following directories can be omitted) -> Species -> Cultivar/Population -> Library -> file_name_including_library_size_and_depth_coverage.fq METADATA.txt

GBS is a little more complex, and we will separate things in two ways.
GBS -> Cut Site/Enzyme (data with different cut sites might be incompatible) -> Data type (population genetics, mapping data) -> From here on things diverge
Pop -> Group* -> files_with_descriptive_names.fq METADATA.txt
Map -> Experiment -> Species -> files_with_descriptive_names.fq METADATA.txt
*groups are based loosely on clades, and on how much data for each species we have (annuus_wild, annuus_cult, argophyllus, bolanderi_exilus, petiolaris_clade, hybrids, perennials, tuberosus_cult)

Generally, file names should include data not encoded in the directory structure, but important enough to be seen when perusing the data. Things like depth of coverage, library size, etc. seem appropriate for all three data types, but for types with which I’m not as familar (GBS), suggestions would be appreciated.

PstI digest tests

Edit: I forgot to thank DanB and Kate for the generous donation of DNA.  Sorry guys 🙁

Edit 2: After the lab meeting some questions have been asked.

To summarize :

1.  All the enzymes work equally well.  There is a slight performance decrease if one uses the PstI-HF.  

2.  I’d recommend using the PstI from Invitrogen – Invitrogen works extremely well and is the cheapest among the Psts tested ( $22 versus $75  ).  EDIT: Brook and Kate let me know that I failed to factor in the units provided for a given dollar value.  Invitrogen provides the cheapest enzyme.

For NEB – With our volume purchases we get 10000U for $71.40.

For Invitrogen – With our volume purchases we get 10000U for $63.90 {represents price negotiated with Helen, accounts manager at Invitrogen Stores}

Thanks guys for pointing that out!

This adds up considering how much we’re going through.

3. Sunflower gDNA does not digest completely in 3 hours.  It is recommended to go overnight or 18 hours with your digestion.

4. After 18 hours do not be alarmed by incomplete digestion.  This is OK according to RFLP work performed by Loren.  There should be a sufficient number of fragments for GBS libraries.

Hi all,

Here are the results from the PstI digest tests:

All enzymes tested fail to fully digest sunflower gDNA.

Enzymes from Invitrogen, Thermofisher/fermentas, NEB (non HF) and NEB HF were tested.  All performed equivalently.  I would say the NEB HF was slightly less processive after 70 min.  See attached PDF for gels and full documentation of reaction conditions.

**** 2013-Oct-08-debono-digesttest *****

Thanks

Allan

GBS, coverage and heterozygosity. Part 3

I’ve recalled the SNPs for my Helianthus exilis population genetics for a third time. This time I’m using GATK. This is aligned using BWA to the Nov22k22 reference.

This is two plates of GBS (96-plex each) plus 34 H. annuus samples from G. Baute (also 96-plex GBS). Three exilis samples were removed because they had little or no reads. Reads were trimmed for adapters and quality using trimmomatic (and the number of  reads kept after trimming are used here).

Exil.GATK.file.het

 

So, there is a relationship between number of reads and resulting heterozygosity. It makes some sense because you need a higher number of reads to call a heterozygote than a homozygote. It’s not as bad as what was happening when we were calling snps using the maximum likelihood script. Using a linear model, number of reads accounts for %16 of the variation in heterozygosity.

If you compare this to my previous posts, the heterozygosity is vastly lower for all samples. That is because I previously looked at variable sites with a certain amount of coverage, and here are all sites. So a majority of these sites are invariant.

 

NEB HF enzymes – regarding library preps

Hi guys,

Just a note about PstI-HF:

The PstI-HF sold from NEB is less processive than the old NEB PstI/red stripe version (go here to see it: https://www.neb.com/products/r0140-psti) .  When digesting ultrapure, high concentration homogeneous DNA (plasmids) digestions fail to go to completion even when left overnight (> 16 hours).  What does this mean for you?  It means that your genomes which are more difficult substrates to deal with will not digest fully unless left to go overnight.  If you want to ensure a good digest use a positive control and switch to the linked product.

 

 

 

 

How my PhD was saved by DTT, or: how to get moderate quantities of clean, unsheared DNA from tricky tissues

I have been trying for months to find a high-throughput solution to DNA extraction from lyophilized (freeze-dried) leaf discs of Helianthus argophyllus, a species known for its intransigent polyphenolics and low DNA yields. And I appear to have found one, thanks to Horne et al. 2004 (here) and Dow Chemical!

Continue reading

Photographs of Helianthus annuus

Over the weekend, I took some photos of the H. annuus that are growing out at Totem Park, including some of Emily, Brook, and Greg’s plants. They are just such photogenic plants, I could not resist. I uploaded a complete set of hi-res .jpg files to the blog server, and they are presented above in the gallery. These are massive, but they are .jpg, so that means they are lossy. If you want to use these for printing, let me know and I can send you the raw files, from which you can make .tif files that are as good as film negatives.

If anyone else has plants flowering that they would like pictures of, let me know! I’d love to get a nice set together for the lab.