A few notes on GATK.
1. GATK requires a younger version of Java than what is on the cluster currently.
> java -jar ./bin/GenomeAnalysisTK-2.8-1-
g932cd3a/GenomeAnalysisTK.jar --help Exception in thread "main" java.lang. UnsupportedClassVersionError: org/broadinstitute/sting/gatk/ CommandLineGATK : Unsupported major.minor version 51.0
Get the linux 64 bit version here:
move it to sciborg:
> scp ../Downloads/jdk-7u45-linux-x64.tar.gz email@example.com:cluster/bin/ #ssh to the cluster extract it: > tar -zxvf jdk-7u45-linux-x64.tar.gz #Add it to your path file or use it like so: > ./bin/jdk1.7.0_45/bin/java -jar ./bin/GenomeAnalysisTK-2.8-1-g932cd3a/GenomeAnalysisTK.jar --help
2. You must have a reference with a relatively small number of contigs/scaffolds
Kay identified and addressed this problem and wrote this script: CombineScaffoldForGATK. I’ve modified it very slightly.
perl CombineScaffoldForGATK.pl GenomeWithManyScaffolds.fa tmp.fa
WARNING: this can print empty lines. The script could be modified to address this but I don’t want to break it. Instead you can fix it with a sed one liner:
sed '/^$/d' tmp.fa > GenomeWith1000Scaffolds.fa
3. You must also prepare the “fasta file”
You also have to index it for BWA with bwa index but GATK needs more, see: http://gatkforums.broadinstitute.org/discussion/1601/how-can-i-prepare-a-fasta-file-to-use-as-reference
In short (in the same dir as your genome):
>java -jar ../bin/picard-tools-1.105/CreateSequenceDictionary.jar R= Nov22k22.split.pheudoScf.fa O= Nov22k22.split.pheudoScf.dict >samtools faidx Nov22k22.split.pheudoScf.fa