The good news: it looks like PstI-MspI GBS offers substantial benefits over PstI alone, even without depleting high copy fragments. Marco’s DSN protocol may improve things further, and we have a trial sequencing library running now.
I sequenced two libraries of an intraspecific F2 cross within H. argophyllus: 95 individuals (and one blank) each on one lane (85 and 79G of raw sequencing data). I compared these data to two interspecific F1 libraries (GregO’s H. bolanderi cross, which includes 132G or 2 lanes of sequencing for 96 individuals, although one lane had very poor results, and Kate’s H. neglectus cross, with 93 individuals in 1 lane or 75G). I expected the F1 interspecific populations to contain more SNPs (from fixed differences in the parent species) than my intraspecific cross, making this potentially a conservative comparison. My libraries contained a few non-F2 (e.g. parental population) individuals, so it’s possible that I should expect more SNPs due to that.
First, I used uneak to call SNPs in each library using the same filters and using both paired ends (via GregB’s “trick uneak” script). I used uneak because it is fast and de novo, so unbiased by divergence from aligning to a reference. The results:
Arg7 F2: 47995 SNPs
Arg8 F2: 48528 SNPs
Bol2 F1: 17662 SNPs
Neg1 F1: 4518 SNPs
The low number of SNPs in Kate’s library is likely an artefact of the higher divergence between H. annuus and H. neglectus. uneak can’t handle more than two alleles (but GregB’s soUneak can!), so it’s likely that a lot of good tags might have been thrown out for representing three alleles. This could be substantially reducing the number of called SNPs in either of the intraspecific crosses, although there is much less divergence between H. annuus and H. bolanderi than with H. neglectus. Nevertheless, the difference in SNP # between GregO’s H. bolanderi F1 library and mine is substantial, and on the order predicted by the increased number of fragments from the PstI-MspI in silico digestions.
An initial observation of GregB’s was that a lot of the sequencing effort of his GBS libraries was going either to single tags (represented only once in the dataset and so likely the result of PCR or sequencing error) or to a few tags that were sequenced at high frequency. This seems to be much improved in three of these libraries, with the percent of data coming from tags sequenced between 10 and 1000 times as (and average # reads/tag, percent of data from tags sequenced 1X):
Arg7 F2: 69% (8.8 reads/tag, 7.9%)
Arg8 F2: 73% (9.3, 7.2%)
Bol2 F1: 59% (3.0, 3.1%) <– likely lower due to lower coverage/sample
Neg1 F1: 78% (12.0, 7.1%)
No more than 15% of the data is from tags sequenced more than 1000x in any library (mean 12%).
One of the initial worries we had was the high copy fragments of plastid DNA dominating the first trial of PstI-MspI GBS done by Allan, at ~40% of the sequenced fragments. In both of my libraries, ten tags (9 of which blast to chloroplast sequence) are at the highest frequency in every sample (at least twice as frequent as the 11th most common tag). Overall, though, and within the individual samples that I checked, the total frequency of these ten tags is not more than 10%. Marco’s depletion protocol may likely improve this further.
These data and more are in the attached excel file, although my labels might be cryptic (ask me if it’s not straightforward). I’m also interested in estimating the number of sites in the genome covered by each library, although this may be biased by divergence… so. I’m open to any suggestions for further comparisons!
Soon to come: a protocol for PstI–MspI GBS, and quantification of the sequencing effort variation among samples, along with handy scripts and tricks.