2 enzyme GBS UNEAK trickery

Want to run UNEAK on a bunch of samples that you sequenced using 2 enzyme GBS? You came to the right place!

This takes R1.fastq.gz and R2.fastq.gz files and makes a single file containing both reads with the appropriate barcodes put on read 2 and the CGG RE site changed to the Pst sequence. This lets you use all of your data in UNEAK… well at least the first 64 bases of all your data!

This also deals with the fact that we get data that is from one lane but split into many files. See example LaneInfo file*. You will also need an appropriately formatted UNEAK key file.

*Only list READ1 in this file. It will only work if your files are named /dir/dir/dir/ABC_R1.fastq.gz and /dir/dir/dir/ABC_R2.fastq.gz.

so the file would read:
/dir/dir/dir/ ABC_R1.fastq C5CDUACXX 3

This is also made to work with .gz fastq files.

greg@computer$ perl bin/MultiLane_UneakTricker.pl design/UNEAK_KEY_FILE design/LaneInfo.txt /home/greg/project/UNEAK/Illumina

You must have the GBS_Fastq_BarcodeAdder_2Enzyme.pl* in the same bin folder. You also need to make a “tmp” folder.


protip: UNEAK will be fooled by soft links so you can use them instead of copying your data.

actual UNEAK

Some of you may have heard me ramble about my little in house Uneak-like SNP calling approach. I am being converted to using the actual UNEAK pipeline. Why? reason #1 is good god is it fast! I processed 6 lanes raw fastq to snp table in ~1 hour. #2 I am still working on – this will be a comparison of SNP calls between a few methods but I have a good feeling about UNEAK right now.

Here is the UNEAK documentation: http://www.maizegenetics.net/images/stories/bioinformatics/TASSEL/uneak_pipleline_documentation.pdf

It is a little touchy to get it working so I thought I would post so you can avoid these problems.

First install tassel3 (not tassel 4) you will need java 6 or younger to support it: http://www.maizegenetics.net/index.php?option=com_content&task=view&id=89&Itemid=119

To be a hacker turn up the max RAM allocation. Edit run_pipeline.pl to change the line “my $java_mem_max_default = “–Xmx1536m”;” to read something like “my $java_mem_max_default = “–Xmx4g”;” (That is 4g for 4G of RAM).

Go to where you want to do the analysis. This should probably be a new and empty directory. Uneak starts by crawling around looking for fastq files so if the directory has some files you don’t want it using it is going to be unhappy.
Make the directory structure:

../bin/run_pipeline.pl -fork1 -UCreatWorkingDirPlugin -w /home/greg/project/UNEAK/ -endPlugin -runfork1 

Move your raw (read: not demultiplexed) fastq or qseq to /Illumina/. If you are using fastq files the names need to look like this: 74PEWAAXX_2_fastq.txt NOT like this:74PAWAAXX_2.fastq and probably not a bunch of other ways. It has to be flow cell name underscore lane number understore fastq dot txt. You might find others work but I know this does (and others don’t).
Make a key file. This is as described in the documentation. It does not have to be complete (every location on the plate) or sorted. Put it in /key/

Flowcell        Lane    Barcode Sample  PlateName       Row     Column
74PEWAAXX       1       CTCGCAAG        425218  MyPlate7        A       1
74PEWAAXX       1       TCCGAAG 425230  MyPlate7        B       1
74PEWAAXX       1       TTCAGA  425242  MyPlate7        C       1
74PEWAAXX       1       ATGATG  425254  MyPlate7        D       1

Run it. Barely enough time for coffee.

../bin/tassel3.0_standalone/run_pipeline.pl -fork1 -UFastqToTagCountPlugin -w /home/greg/project/UNEAK/ -e PstI -endPlugin -runfork1
# -c how many times you need to see a tag for it to be included in the network analysis
../bin/tassel3.0_standalone/run_pipeline.pl -fork1 -UMergeTaxaTagCountPlugin -w /home/greg/project/UNEAK/ -c 5 -endPlugin -runfork1
# -e is the "error tolerance rate" although it is not clear to me how this stat is generated
../bin/tassel3.0_standalone/run_pipeline.pl -fork1 -UTagCountToTagPairPlugin -w /home/greg/project/UNEAK/ -e 0.03 -endPlugin -runfork1
../bin/tassel3.0_standalone/run_pipeline.pl -fork1 -UTagPairToTBTPlugin -w /home/greg/project/UNEAK/ -endPlugin -runfork1
../bin/tassel3.0_standalone/run_pipeline.pl -fork1 -UTBTToMapInfoPlugin -w /home/greg/project/UNEAK/ -endPlugin -runfork1
#choose minor and max allele freq
../bin/tassel3.0_standalone/run_pipeline.pl -fork1 -UMapInfoToHapMapPlugin -w /home/greg/project/UNEAK/ -mnMAF 0.01 -mxMAF 0.5 -mnC 0 -mxC 1 -endPlugin -runfork1