Where does all the GBS data go?

Why do we get seemingly few SNPs with GBS data?

Methods: Used bash to count the occurrences of tags. Check your data and let me know what you find:

cat Demutliplexed.fastq | sed -n '2~4p' | cut -c4-70 | sort | uniq -c | grep -v N | sed 's/\s\s*/ /g' > TagCounts

Terminology: I have tried to use tags to refer to a particular sequence and reads as occurrences of those sequences. So a tag may be found 20 times in a sample (20 reads of that sequence).

Findings: Tag repeat distribution

Probably the key thing and a big issue with GBS is the number of times each tag gets sequence. Most sites get sequenced once but most tags get sequence many many times. It is hard to see but here are histograms of the number of times each tags occur for a random sample (my pet11): All tags histogram (note how long the x axis is), 50 tags or less Histogram. Most tags occur once – that is the spike at 1. The tail is long and important. Although only one tag occurs 1088 times it ‘takes up’ 1088 reads.

How does this add up?

In this sample there are 3,453,575 reads. These reads correspond to 376,036 different tag sequences. This would mean (ideally) ~10x depth of each of the tags. This is not the case. There are a mere 718 tags which occur 1000 or more times but they account for 1394905 reads. That is 40% of the reads going to just over 700 (potential) sites. I’ve not looked summarized more samples using the same method but I am sure it would to yield the same result.

Here is an example: Random Deep Tag. Looking at this you can see that the problem is worse than just re-sequencing the same tag many times but you introduce a large number of tags that are sequencing and/or PCR errors that occur once or twice (I cut off many more occurances here).

Conclusion: Poor/incomplete digestion -> few sites get adapters -> they get amplified like crazy and then sequenced.

Update 1:

Of the > 3.4Million tags that are in the set of 18 samples I am playing with only 8123 are in 10 or more of the samples.

For those sites with >10 scored the number of times a tag is sequenced is correlated between samples. The same tags are being sequenced repeatedly in the different samples.

Update 2:

As requested the ‘connectivity’ between tags. This is the number of 1bp miss matches each tag has. To be included in this analysis a tag must occur at least 5 times in 3 individuals. Here is the figure. So most tags have zero matches a smaller number have one and so on. This actually looks pretty good – at least how I am thinking about it now. This could mean that the filtering criteria works. If errors were still being included I would expect tags with one match (the actual sequence) to occur more than ones with zero.

11 thoughts on “Where does all the GBS data go?

  1. Hi Greg, thanks for the post. I think this would be a good place to start a lab-wide (or GBS-users-wide) discussion on what might be the problem with the current GBS protocol, and on how to improve it (if possible) .

    I agree that poor digestion would result in few fragments with high coverage. I am just thinking aloud here, but would you expect in that case to have all those one-read-tags? They must have both adapters to get sequenced, so why aren’t they amplified? Do you know if there is something special about the many-reads-tags? Are they more or less common between different libraries or are they just a random subset of the reads?

    If the problem is specific to PstI changing the enzyme might solve it, and that’s certainly worth trying. I just have a hard time figuring out a reason why PstI cutting would be specifically affected in sunflower DNA. According to my personal experience, PstI is in fact as reliable as it gets for a restriction enzyme (it is actually one of my two fave restriction enzymes ♡), and if I understood correctly it has been historically used on sunflower genomic DNA for other applications with success. Non enzyme-specific factors (contaminants in the DNA, poor ligation), would results in a similarly reduced pool of fragments. Just saying that we should keep in mind other possible explanations. Do you know how clean was the initial DNA?

    If the problem is PstI-specific poor digestion, a double digestion with MspI might help a bit, but probably not solve the problem.

  2. My thinking is the tags that appear once are sequencing errors of reads that occur 1000+ times. For the example tags I posted this could be the case.
    I am looking into the “many read tags” now.
    I am sure Pst is cutting it is just a question of how much. My thinking has been less is more – fewer sites to sample means less missing data. But maybe we are sampling too few sites. Based on the ‘in silico digestion’ I did Pst has fewer sites in sunflower than many other enzymes. Maybe a poor digestion with infrequent RE sites is the problem. They might not be the right distance apart very often. Some expected size distributions could be pulled out of the genome sequence pretty quick and might be informative. Yes we should discuss this more. Here is the quick and dirty RE site count I did:
    PstI CTGCAG 255507
    HpaII/MspI CCGG 3134192
    EcoRI GAATTC 650509
    EcoT22I ATGCAT 742575
    SacI GAGCTC 398494
    Clai ATCGAT 514253
    Sal l GTGAC 1524407

  3. Hey guys,
    Take a look at this paper.

    PstI digests of Wheat gDNA resemble those of sunflower. Fellers shows PstI is not appreciably different from buffer only control. I think we’re getting all the cutting PstI can do. The limited numbers of sites and spacing within the genome makes for, mostly large pieces of DNA (>20 kb) after digestion (see previous blog posts).
    DISCLAIMER: PstI is one of my go to enzymes for cloning (see HaFTx-YFP)

  4. This is super helpful!

    You can calculate the probability of sequencing error as the per base pair sequencing error rate (e.g. 10e-3 for quality score of 30) multiplied by the read length. So if you had an average quality score of 30 and 100bp read lengths, then a whopping 10% of your total reads would be off by 1bp due to sequencing error alone. So for 3,453,575 reads you would expect about 345,000 1x coverage due to sequencing error – I think this is close to what you see (1st bar at far left of histogram). Of course this is rough since the quality decreases with read length, but the later bp also seems to be where most of your single bp differences are.

    What about the second bar, where coverage is 2x. Could this be due to sequencing error alone? The probability of getting the same sequencing error twice is much smaller – it’s the probability of having two sequencing errors in two reads from the same location AND both having the same base pair substitution. Using the same example above, that’s (10e-3) x (10e-3) x P(1/3). The 1/3 I am thinking because: given that there is a mutation at the same site in two reads, there is a 1 in 3 chance that the altered base pair is the same one. I think you would multiply that by the coverage but since most sites have coverage < 10,000x it is going to be <1. So the bottom line is that most of the 1x coverage can probably be chalked up to sequencing error and eliminated off the bat. BUT the high number of low coverage (2x, 3x, etc.) must be from something else. PCR error is a good candidate: calculating PCR error rates are more complicated, but would be fun to work out in full detail. Here is a teaser: let's say you 'only' had a paltry 1,000 amplifiable fragments (i.e. locations across the genome), of 100bp in length each. That's 100,000 total base pairs. Using standard taq (~10e-5 error rate) you would expect just ~1 bp substitutions each PCR cycle. But in an 18 cycle PCR reaction the first 1 of 1000 fragments that get amplified 2e17 times! And this is the PER CYCLE error rate, An additional mutation in the 2nd cycle get amplified 2e16 times, etc. So you could end up with a lot of low (say 2x-20x) coverage due to PCR errors as well. So one question might be, what does the same distribution look like if you exclude <10x, <50x and <100x?

    • That is very interesting that the numbers come out so close. I have not filtered this for quality. I do think PCR errors are probably a big part of this. I can get those distributions for lab meeting tomorrow.

      • Hi Rob and Greg, sequencing errors and PCR artifacts seem indeed a very plausible explanation for the low coverage tags. I had assumed the reads had been “aligned” allowing some mismatches.

  5. One thing we should remember, is that if the singlets are largely sequencing errors, they aren’t all useless. If we use an alignment based SNP caller then the rest of the read is going to be providing us information and we’ll just discard the wrongly called base during SNP calling.

    What would be interesting would be the percentage of singlets with no neighbours (i.e. tags that are one bp off). Baute, do you have any insight into that from your SoUneak program?

    Additionally, I wonder how it would look with stringent read quality filtering. Presumably much of the singlets would be gone.

    Lastly, it would be good to see how our data compares to other species that use our protocol (warblers and stickleback). The Irwin lab uses a variation on our protocol with individual PCRs for each sample and the Schluter lab has done our exact protocol. If its sunflower DNA messing up our work then we would expect our distribution to be much worse. I think Kieran using stickleback DNA got more SNPs than we do but I haven’t done any real concrete comparisons.

    • Yes, a good aligner/snp caller combo should address this.
      Adding a figure about number of miss matches to the post now…
      Yes, filtering would be good. It would at least tell us about PCR vs seq errors. For this method it mean throwing away any reads with any base of a poor enough score.
      We should invite those people for tomorrows meeting.

  6. Do I understand correctly from update 1 that the more abundant tags are largely common between samples? If that is the case, than maybe it’s not so bad.

    In case of poor digestion you would expect the enzyme to cut at low efficiency at any of the possible restriction sites. These sites wouldn’t always be the same ones. So you shouldn’t necessarily see a reduction in complexity (unless the number of genome molecules in your sample is limiting), but just in abundance of each fragments. Since you have very little of each fragment, only some of the fragments will manage to get adapters and therefore be amplified. If that is the case, you should have a different subset of all the possible fragments being amplified for each sample, with little overlap.
    If that is not the case, and the abundant fragments are largely shared between samples, then my suspicion is that PstI cuts well the sites where it cuts. The reason why you don’t get many fragments is either because simply there aren’t many instances when two restriction sites are close enough to give a fragment of the right size (as also Allan pointed out), or because some of the sites are not available for some reason (this seems more far-fetched). Either way, adding MspI/HpaII to the mix should do the trick by considerably increasing the number of possible fragments.

  7. I tried the basic analysis found in this post on 2 of my samples. The results are very similar to Greg’s and are as follows:

    Sample 1:
    5,375,995 reads corresponding to 454,852 different tag sequences. There are 1013 tags that occur 1000+ times and they account for 1,909,650 (36%) of the reads.

    Sample 2:
    1,285,553 reads corresponding to 130,538 different tag sequences. There are 35 tags that occur 1000+ times and they account for 111,369 (9%) of the reads.

Comments are closed.