GBS two enzyme barcode sniffer

So you’ve sequenced a lane of GBS but you’ve forgotten what barcodes you used?

A) That’s horrible!

B) I can help you out.

This script will take raw sequence data and pull out the barcode sequences in it. It assumes that you’re using PstI-MspI digestions and outputs the most common sequence tags before the enzyme cut site. These should be your barcode sequences. It outputs the sequence counts for each possible barcode pair. In this case, N is no enzyme cut site or no barcode (which happens if you use the unbarcoded common adaptor). It looks at the first 1,000,000 reads of your files, and my general rule of thumb is that your real barcodes are going to show up more than 1000 times (at least with 192 samples).

A quick check would be to take the output of the script and run this set of commands on it to count the number of barcodes it finds.

scriptoutput.txt | grep -v “^N” | awk ‘($3 > 1000)’ | wc -l

If this number may be slightly smaller than your real number of barcoded samples just because some samples didn’t sequence well, which is normal. If the number is larger or the barcodes listed shouldn’t be in your library, then you should be concerned.