Purging GBS index switching

Considering the amount of sequencing coming out of the newer Illumina machines, we’ve started to combine GBS libraries with other samples. Due to how GBS libraries are made, when multiplexed with whole genome sequencing samples, there is an appreciable amount of contamination from GBS to WGS. That means you will find GBS reads in your WGS data. I’ve quantified that in the following figure, showing the percent of barcoded reads in WGS samples.

The left side is contamination from barcodes sequenced in different lanes (i.e. ones where they couldn’t contaminate). The right side is barcodes from GBS samples sharing the same lane (i.e. ones that could contaminate. The take home message is that between 1% to 15% of reads are GBS contamination. This is far above what we want to accept so they should be removed.

I’ve written a script to purge barcoded reads from samples. You give it the set of possible barcodes, both forward and reverse (All current barcodes listed here: GBS_2enzyme_barcodes). I’ve been conservative and been giving it all possible barcodes, but you could also trim it to only the barcodes that would be present in the lane. It looks for reads that start with the barcode (or any sequence 1bp away from the barcode to account for sequencing error) plus the cut site. If it finds a barcoded read, it removes both directions of reads. It outputs some stats at the end to STDERR. Note, this was written for 2-enzyme PstI-MspI GBS, although could be rewritten for other combinations.

An example of how you could apply this:

Make a bash script to run the perl script:

input=$1
perl ../purge_GBS_contamination.pl /home/gowens/bin/GBS_2enzyme_barcodes.txt ${input}_R1.fastq.gz ${input}_R2.fastq.gz ${input}.tmp;
gzip ${input}.tmp_R*.fastq

Run this bash script using gnu parallel

ls | grep .gz$ | grep R1 | sed s/_R1.fastq.gz//g | parallel -j 10 bash .decontaminate.sh 2>> decontamination_log.txt

 

 

 

 

 

One thought on “Purging GBS index switching

  1. As a clarification, the GBS reads that are mistakenly indexed are “contaminations” with respect to the WGS library, but are perfectly good GBS reads and can/should be used as such.
    My understanding of the reason why GBS reads leak into the WGS files is that our GBS libraries currently do not use indices (barcodes in the middle of an adapter), while our WGS libraries do. Indices are sequenced in a short sequencing reaction that is independent of the usual two reads in paired-end libraries. Our GBS reads do not even have the sequence that is recognized by the primer that is used to sequence the indices, and therefore clusters form GBS reads stay completely dark during this extra sequencing cycle. However, nearby clusters from WGS reads will light up, and the image analysis software can erroneously attribute the signal from a nearby WGS cluster to a GBS cluster, which will look like it has an index (and will, therefore, be demultiplexed as a WGS library).

Leave a Reply