Minimizing job queuing delays on Compute-Canada / SLURM

If you have several thousands of similar jobs to submit to a SLURM cluster such as Compute-Canada, one of your goals, aside from designing each job run as quickly as possible, will be to reduce the queuing delays — time spent waiting for an idle node to accept your job. Scheduling delays can quickly shadow the runtime of your job if you do not take care when requesting job resources — you may wait hours, or days even. One way to minimize waiting, of course, to schedule fewer jobs overall (e.g. by running more tasks within a given allocation if your allocation’s time is not expired). If you can fit everything in a single job, then perfect. But what if you could divide the experiment in two halves to run concurrently, thereby obtaining close to 200% speedup, and also spend less wallclock time in the job queue, waiting?

When working with compute tasks whose execution parameters can be moved along  several different resource requirement axes (time, # threads, memory), the task of picking the right parameters can become difficult. Should I throw in more threads to save time? Should I use fewer threads to save ram? Should i divide the inputs into smaller chunks to stay within a time limit? Wait times are also, sometimes, multiplied if the jobs are resubmitted due to unforeseen errors (or change of parameters).

It helps to understand how the SLURM scheduler makes its decisions when picking the next job to run on a node. The process is not exactly transparent. The good news is that it’s not fully opaque either —  There are hints available.

Selecting resources.

Once a job has been submitted via commands such as sbatch or srun, it will enter the job queue, along with the other thousands of jobs submitted by other fellow researchers. You can use squeue  to see how far back you are in line, but that doesn’t tell you much. For instance, that will not always provide estimated start times.

You most likely already know that the amount of resources you request for your job, e.g. number of nodes, number of cpus, RAM, and wall-clock timelimit, have an influence on your job’s eligibility to run sooner. However, varying the amount of resources can have surprising effects on time spent in the scheduling state (aka state PENDING). The Kicker:

In some cases, increasing requested job resources can lower overall queuing delays.

Compute nodes in the cluster have been statically partitioned by the admins. Varying the resource constraints of your job will change the set of of nodes available to run it somewhat similar to a step function. Each compute node on compute-canada (cedar and graham) is placed in one or more partitions:

  • cpubase_bycore (for jobs wanting a number of cores, but not all cores on a node)
  • cpubase_bynode (for jobs requesting all the cores on a node)
  • cpubase_interac (for interactive jobs e.g., `salloc`)
  • gpubase_… (for jobs requesting gpus)
  • etc.

If you request a number of nodes less than the number of cores available on the hardware, then your job request is waiting for an allocation in the _bycore partitions. If you request all of the cores available, then your job request will wait for an allocation in the _bynode partitions. So based on availability, it might help to ask for more threads, and configure your jobs to process more work at once. The SLURM settings will vary by cluster — the partitions above are for cedar and graham. Niagara, the new cluster, for instance, will only be doing by_node allocations.

On compute-canada: You can view the raw information about all available nodes and their partition with the sinfo command. You can also view the current node availability, by partition using partition-stats.

The following command outputs a table that shows the current number of queued jobs in each partition. If you look in the table closely, you learn that there is only one job in the queue for any node in the _bynode partition (for jobs needing less than 3 hours). The _bycore partition, on the other hand, has a lot of jobs sitting on it patiently. If you tweak your job to make it eligible for the partition with the most availability (often it is the one with the most strict requirements), then you minimize your queuing times.

$ partition-stats -q 

     Number of Queued Jobs by partition Type (by node:by core) 

Node type | Max walltime 
          | 3 hr  | 12 hr   | 24 hr   | 72 hr  | 168 hr | 672 hr  |
----------|--------------------------------------------------------
Regular   | 1:263 | 429:889 | 138:444 | 91:3030| 14:127 | 122:138 |
Large Mem | 0:0   | 0:0     | 0:0     | 0:6    | 0:125  | 0:0     |
GPU       | 6:63  | 48:87   | 3:3     | 8:22   | 2:30   | 8:1     |
GPU Large | 1:-   | 0:-     | 0:-     | 0:-    | 1:-    | 0:-     |
----------|--------------------------------------------------------

“How do I request a _bynode partitioning instead of a _bycore ?”

That is not obvious, and quasi undocumented. The answer is that you do so by asking for all the cpus available on the node. This is done with sbatch --cpus-per-task N. To get the best number of CPUs N, you have to dig a bit deeper and look at the inventory (this is where the sinfo command comes handy). The next section covers this. And this is something that may change over time as the cluster gets upgrades and reconfigurations.

Also, if you request more than one node in your job, each with N CPUs (e.g. sbatch --nodes=3 --cpus-per-task=32 ...), then all of them will be _bynode.

Rules of Thumb

Here are some quick rules of thumb which work well for the state of the cedar cluster, as of April 2018. The info presented in this section was gathered with sinfo, through conversations with compute-canada support, and through experience (over the course of a few weeks). In other words, I haven’t attempted to systematically study job wait times over the course of months, but I will claim that those settings have worked the best so far for my use cases (backed with recommendations from the support team).

  1. Most compute nodes have 32 CPUs installed.
    If you sbatch for N=32 cpus --cpus-per-task=32, you are likely to get your job running faster than if you ask for, say N=16  cpus. If your job requires a low-number of CPUs, then it might be worth exploring options where 32 such jobs are run in parallel. It’s okay to ask for more, but try to use all the resources you ask for, since your account will be debited for them.
  2. Most compute nodes have 128G ram.
    If you keep your job’s memory ceiling under that, you’re hitting a sweet spot. You’ll skip a lot of the queue that way. Next brackets up are  (48core, 192G) (half as many as in the 128G variety), and (32 core, 256G) (even fewer).
  3. Watch out for the --exclusive=user flag.
    The flag --exclusive=user tells the scheduler that you wish your job to only be colocated with jobs run by your user. This is perhaps counter-intuitive, but it does not impose a _bynoderestriction. In the case where your job is determined to be _bynode(i.e., you request enough cpus to take a whole node), this flag is redundant. If you don’t ask for all cores available (meaning your job needs a _bycore allocation), then this flag will prevent jobs from other users from running on that node (on the remaining cpus). In that case, the flag will likely hurt your progress (unless you have many such jobs that can fill that node).
  4. The time limit you pick matters. Try to batch your work in <= 3H chunks.
    A considerable number of nodes will only execute (or favor) tasks which can complete in less than 3H of wallclock time. There’s a considerable larger amount of nodes that will be eligible to you also if you go under the 3H wallclock time limit. The next ‘brackets’ are 12H, then 24H, 72H, 168H (1 wk), and 28d. This suggests that there is no benefit to asking for 1H vs 3H, or 22H vs 18H, although an intimate conversation with the scheduler’s code would confirm that.

But, I want it now.

It should be mentioned that often, if only a single “last-minute” job needs to be run, salloc (same arguments as sbatch, mostly) can provide the quickest turnaround time to start executing a job. It will get you an interactive shell to a node of the requested size within a few minutes of asking for it. There is a separate partition, cpubase_interac, which answers those requests. Again, it is worth looking at the available configurations. Keep salloc it in your back pocket.

 

Purging GBS index switching

Quote

Considering the amount of sequencing coming out of the newer Illumina machines, we’ve started to combine GBS libraries with other samples. Due to how GBS libraries are made, when multiplexed with whole genome sequencing samples, there is an appreciable amount of contamination from GBS to WGS. That means you will find GBS reads in your WGS data. I’ve quantified that in the following figure, showing the percent of barcoded reads in WGS samples.

The left side is contamination from barcodes sequenced in different lanes (i.e. ones where they couldn’t contaminate). The right side is barcodes from GBS samples sharing the same lane (i.e. ones that could contaminate. The take home message is that between 1% to 15% of reads are GBS contamination. This is far above what we want to accept so they should be removed.

I’ve written a script to purge barcoded reads from samples. You give it the set of possible barcodes, both forward and reverse (All current barcodes listed here: GBS_2enzyme_barcodes). I’ve been conservative and been giving it all possible barcodes, but you could also trim it to only the barcodes that would be present in the lane. It looks for reads that start with the barcode (or any sequence 1bp away from the barcode to account for sequencing error) plus the cut site. If it finds a barcoded read, it removes both directions of reads. It outputs some stats at the end to STDERR. Note, this was written for 2-enzyme PstI-MspI GBS, although could be rewritten for other combinations.

An example of how you could apply this:

Make a bash script to run the perl script:

input=$1
perl ../purge_GBS_contamination.pl /home/gowens/bin/GBS_2enzyme_barcodes.txt ${input}_R1.fastq.gz ${input}_R2.fastq.gz ${input}.tmp;
gzip ${input}.tmp_R*.fastq

Run this bash script using gnu parallel

ls | grep .gz$ | grep R1 | sed s/_R1.fastq.gz//g | parallel -j 10 bash .decontaminate.sh 2>> decontamination_log.txt

 

 

 

 

 

Splits tree and iupac coding

I’ve been running splitstree to see the relationships between samples of H. bolanderi. I have coded my data using IUPAC so hets are a different symbol (Y, W, etc). The other way to code heterozygous sites into fasta is just pick one allele randomly.

By default, splitstree ignores all ambiguous sites, so if you use IUPAC coding, it will ignore all those sites. I switched it from ignoring, to using an average for all possible alleles. This made my tree much messier and had a weird smattering of samples pulled toward the outgroup. I’ve figured out that it has to do with the amount of missing data. Since Ns are ambiguous, when you average N it just sort of homogenizes the distances between samples. Thus, it can pull your samples into weird positions if they have different amounts of missing data.

My thoughts are that you should just ignore ambiguous data if you have enough sites to resolve your samples without them.

Need cM positions?

Hello,

For a lot of work, it is helpful to know the cM position of your snps. Here is a script that takes the linkage map produced by Chris, your snp table in Hapmap format, and adds a column giving you a cM position for each site. It interpolates between known positions, so individual positions the accuracy of the position is overstated, but it’s good for plotting.

https://github.com/owensgl/reformat/blob/master/hmp2cmpositions.pl

usage:

perl hmp2cmpositions.pl /moonriseNFS/HA412/pseudomolecules/lg.ALL.bronze14.path.txt yoursnptable.hmp > yoursnptable.withcm.hmp

 

SmartGit GUI Tool

NOTE: This is an old draft post from Thuy (last updated 17 Dec 2012). I’m publishing it because it seems useful and mainly complete. –Brook

What is Git?

Git is a distributed source control version system.  It allows multiple people to work on the same code simultaneously by keeping track of changes made to files.  It visualizes differences between file versions and merges changes from different authors.  It also makes snapshots of file versions, so that you can go back to any version later.  Because git is distributed, you store a copy of the code repository and its change history on your own local machine.  When you are ready, you can sync your files to a remote repository server, such as BitBucket or GitHub.  Syncing to the remote server will share the updated code with all the other users, and they can merge the changes into their own copies if they wish.  Whether or not you use a remote repository server, git will always store your entire repository change history on your local machine.

Continue reading

2 enzyme GBS UNEAK trickery

Want to run UNEAK on a bunch of samples that you sequenced using 2 enzyme GBS? You came to the right place!

This takes R1.fastq.gz and R2.fastq.gz files and makes a single file containing both reads with the appropriate barcodes put on read 2 and the CGG RE site changed to the Pst sequence. This lets you use all of your data in UNEAK… well at least the first 64 bases of all your data!

This also deals with the fact that we get data that is from one lane but split into many files. See example LaneInfo file*. You will also need an appropriately formatted UNEAK key file.

*Only list READ1 in this file. It will only work if your files are named /dir/dir/dir/ABC_R1.fastq.gz and /dir/dir/dir/ABC_R2.fastq.gz.

so the file would read:
/dir/dir/dir/ ABC_R1.fastq C5CDUACXX 3

This is also made to work with .gz fastq files.

usage:
greg@computer$ perl bin/MultiLane_UneakTricker.pl design/UNEAK_KEY_FILE design/LaneInfo.txt /home/greg/project/UNEAK/Illumina

You must have the GBS_Fastq_BarcodeAdder_2Enzyme.pl* in the same bin folder. You also need to make a “tmp” folder.

LaneInfo
MultiLane_UneakTricker_v2
GBS_Fastq_BarcodeAdder_2Enzyme_v2

protip: UNEAK will be fooled by soft links so you can use them instead of copying your data.

Percent reads aligned collector

When you publish next gen sequencing data you have to include the percent reads aligned. The number is easy to get but when you have 200+ samples it’s a pain to collate them together. This script takes a directory with bam files, uses samtools flagstat to get percent reads aligned and then does a little rejiggering of format to put it in a nice list. To run it, enter the directory with the bams and type ‘bash ./percent_counter.bash’

percent_counter.bash NOTE: The script is gzipped.

SNP calling with GATK warning

I’ve come to the realization that I’ve been making a mistake in my SNP calling using GATK. I’ve been filtering for site quality not for individual genotype quality. This is a critical difference and leads to some dubious SNPs being called.

It works like this. For each position in the genome, GATK will use all the data together to call whether an alternate allele exists. This is represented by the phred scaled QUAL score. A high QUAL score means that the alternate allele exists in in your dataset. This number is often very high if you’re analyzing a whole plate of GBS because it takes into account all the data from all the individuals. I’ve been filtering based on this number.

There is also a MQ score, which is the mapping quality for all the reads at that site.

For each individual, if there are any reads at that site GATK will output a genotype call (i.e. 0/0, 0/1, 1/1). It also outputs the number of reads supporting each allele, the QT score (which is the phred scaled probability of that the genotype call being correct) and the likelihoods of each possible genotype. When you use select variants in GATK, it filters sites but not genotypes (so you can have a site that is good, but individual genotype calls at that site that are bad).

If you use vcf-to-tab to convert your vcf to tab separated it will accept every genotype call. This means that for some individuals you will have 1 read supporting the reference and it will be called as 0/0. The QT score will be 3, which is incredibly low, but vcf-to-tab ignores that information and calls it. Therefore a proportion of your SNPs are highly questionable. This likely contributes to the heterozygote loss seen in GBS data.

Sam and Chris wrote a script that converts a vcf file to tab but also makes sure that the quality (QUAL, QT, MQ, and depth) is good. I’ll post it if they’re OK with it be up.

Edit: Here are the scripts. They are identical except one outputs in IUPAC format (AT=W, AC=M, etc) and the other outputs in the standard double nucleotide format (AA, AT, etc).vcf2vertical_dep_GATK-UG vcf2vertical_dep_GATK

Also note that sometimes GATK will output a genotype call that looks like:

./.:.:3

While normally they should look like:

./. or 0/0:17,0:18:45:0,45,353

This is saying that there are three reads at this position, but the quality is so bad that GATK is not calling the genotype. If you have these, the scripts will throw an error but will correctly output an NN for that genotype. So if you get a normal output and many errors, it’s probably OK.

BayEnv

Apparently BayEnv is the way to do differentiation scans. Here I offer some scripts and advice to help the you use Bayenv.

Install, no problem get it here: http://gcbias.org/bayenv/ Unzip.

I have to say the documentation is not very good.

There is 2 things you will probably want to use BayEnv for:
1) Environmental correlations
2) XtX population differentiation
I will just be showing XtX here. Both need a covariance matrix. This is not the “matrix.out” file you get from the matrix estimation step.
. But first the SNP table.

Here is a script Greg O made to make a “SNPSFILE”* for bayenv from our standard SNP table format: SNPtable2bayenv I’ve made a small modification so that it excludes any non-bi-allelic site and to also print out a loci info file.
Use it like this:

perl SNPtable2bayenv.pl SNPfile POPfile SampleFile

This will make three files with “SampleFile” appended to the start of the name.
*not actually a SNP file at all, it is an allele count file.

Input SNPfile (all spaces are tabs):

CHROM POS S1 S2 S3 S4 S5
chr1 378627 AA TT NN TT AA
chr2 378661 AA AG NN GG AA
chr3 378746 GG AA NN AA GG
chr4 383246 NN GG NN GG TT
chr5 397421 CC GG NN GG CC
chr6 397424 GG AA NN AA GG
chr7 409691 NN CC NN NN GG
chr8 547653 AA GG NN GG AA
chr9 608412 AA GG NN NN AA
chr10 756412 GG CC NN CC GG
chr11 1065242 NN TT NN NN CC
chr12 1107672 NN TT NN TT NN

POPfile (all spaces are tabs):

S1 p1
S2 p1
S3 p2
S4 p2
...

Now estimate the covarience matrix:

./bayenv2 -i SampleFile.bayenv.txt -p 2 -k 100000 -r21345 > matrix.out

Now this is “matrix.out” file is not what you actually want. You have to use it to make a new file that contains a single covariance matrix. You could use “tail -n3 matrix.out | head -n2 > CovMatrix” or just cut it out yourself. Currently I am just using the last (and hopefully converged) values.
My CovMatrix.txt looks like this (again single tabs between):

3.009807e-05 -1.147699e-07   
-1.147699e-07 2.981001e-05

You will also need a environment file. I am not doing the association analysis so am just using some fake data. It is one column for each population and each row is the environmental values. Tab separated.
FakeEnv.txt

1 2

Now you can do the population differentiation test for each snp using a “SNPFILE”. I bet you think you are ready to go? Not so fast, you need a “SNPFILE” not a “SNPSFILE”. Not that either actually really have SNPs in them, just allele counts. Not that I am bitter. SNPFILE is just one site from the SNPSFILE. Here is a script that takes your “SNPSFILE” and runs the XtX program on each of the sites: RunBayenvXtX

perl RunBayenvXtX.pl SampleFile.bayenv.txt SampleFile.LocInfo.txt CovMatrix.txt FakeEnv.txt 2 1 

Where 2 is the number of populations and 1 is the number of environmental variables.
This will produce a file that has the site, allelecountfile name and the XtX value.

FTP to a server

If you want to download something from a sequence read archive, often the files are available on ftp. A browser will automatically download those ftp files if you click the link, but if you are working on a server, it is a pain to download the file to your home computer and then upload it. Here is a quick guide to do this on the command line.

1)Copy the link address for the ftp file you want, from the website.

ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR168/ERR168678/ERR168678_2.fastq.gz

2) SSH to your server and navigate to the folder your want to put the data in.

3) Type ‘ftp’. You are now in the ftp prompt.

ftp

4) Open a connection to the ftp server by typing in ‘open <server name>’. It asks you for a name, and if you don’t have one try ‘anonymous’. It asks for a password, try just pressing enter.

ftp> open ftp.sra.ebi.ac.uk
Connected to ftp.sra.ebi.ac.uk.
220-
220- ftp.era.ebi.ac.uk FTP server
220
Name (ftp.sra.ebi.ac.uk:owens): anonymous
331 Please specify the password.
Password:
230 Login successful.
Remote system type is UNIX.
Using binary mode to transfer files.

5) Navigate to the file you want using ‘cd’ to change directory and ‘ls’ to view directory contents.

ftp> cd vol1/fastq/ERR168/ERR168678
250 Directory successfully changed.
ftp> ls
200 PORT command successful. Consider using PASV.
150 Here comes the directory listing.
-r–r–r– 1 ftp ftp 10990053136 Dec 22 2012 ERR168678_1.fastq.gz
-r–r–r– 1 ftp ftp 10998002687 Dec 22 2012 ERR168678_2.fastq.gz
226 Directory send OK.

6) Download data using ‘get <filename>’

get ERR168678_2.fastq.gz

7) Your data is now downloading. Its best to run this in a screen so that you don’t have to keep the window open, in case it runs slowly.

Approximate Bayesian Computations

In many cases it may be more straightforward (and informative) to test specific models using our data. An interesting approach for inferring population parameters and/or model testing is approximate Bayesian computations (ABC). There are several available tools such as msBayes, DIYABC, PopABC abctools R package, ABCTools.

Although ABC is a powerful and useful approach it has some caveats, e.g. choice of summary statistics, number and complexity of the models tested, amount of data and more. For realistic expectations and simple models ABC could really add some interesting insights to popgen studies.

Paired End for Stacks and UNEAK

Both stacks and uneak are made for single end reads. If you have paired end data here is a little cheat that puts “fake” barcodes onto the mate pairs and prints them all out to one file. It also adds the corresponding fake quality scores.

perl GBS_fastq_RE-multiplexer_v1.pl BarcodeFile R1.fastq R2.fastq Re-barcoded.fastq

BarcodeFile should look like (same as for my demultiplexing script) spaces must be tabs:
sample1 ATCAC
sample2 TGCT

# note this could also look like this:
ATCAG ATCAG
TGCT TGCT

As it does not actually use the names (it just looks at the second column).

Here it is:
GBS_fastq_RE-multiplexer_v1

actual UNEAK

Some of you may have heard me ramble about my little in house Uneak-like SNP calling approach. I am being converted to using the actual UNEAK pipeline. Why? reason #1 is good god is it fast! I processed 6 lanes raw fastq to snp table in ~1 hour. #2 I am still working on – this will be a comparison of SNP calls between a few methods but I have a good feeling about UNEAK right now.

Here is the UNEAK documentation: http://www.maizegenetics.net/images/stories/bioinformatics/TASSEL/uneak_pipleline_documentation.pdf

It is a little touchy to get it working so I thought I would post so you can avoid these problems.

First install tassel3 (not tassel 4) you will need java 6 or younger to support it: http://www.maizegenetics.net/index.php?option=com_content&task=view&id=89&Itemid=119

To be a hacker turn up the max RAM allocation. Edit run_pipeline.pl to change the line “my $java_mem_max_default = “–Xmx1536m”;” to read something like “my $java_mem_max_default = “–Xmx4g”;” (That is 4g for 4G of RAM).

Go to where you want to do the analysis. This should probably be a new and empty directory. Uneak starts by crawling around looking for fastq files so if the directory has some files you don’t want it using it is going to be unhappy.
Make the directory structure:

../bin/run_pipeline.pl -fork1 -UCreatWorkingDirPlugin -w /home/greg/project/UNEAK/ -endPlugin -runfork1 

Move your raw (read: not demultiplexed) fastq or qseq to /Illumina/. If you are using fastq files the names need to look like this: 74PEWAAXX_2_fastq.txt NOT like this:74PAWAAXX_2.fastq and probably not a bunch of other ways. It has to be flow cell name underscore lane number understore fastq dot txt. You might find others work but I know this does (and others don’t).
Make a key file. This is as described in the documentation. It does not have to be complete (every location on the plate) or sorted. Put it in /key/

Flowcell        Lane    Barcode Sample  PlateName       Row     Column
74PEWAAXX       1       CTCGCAAG        425218  MyPlate7        A       1
74PEWAAXX       1       TCCGAAG 425230  MyPlate7        B       1
74PEWAAXX       1       TTCAGA  425242  MyPlate7        C       1
74PEWAAXX       1       ATGATG  425254  MyPlate7        D       1
...

Run it. Barely enough time for coffee.

../bin/tassel3.0_standalone/run_pipeline.pl -fork1 -UFastqToTagCountPlugin -w /home/greg/project/UNEAK/ -e PstI -endPlugin -runfork1
# -c how many times you need to see a tag for it to be included in the network analysis
../bin/tassel3.0_standalone/run_pipeline.pl -fork1 -UMergeTaxaTagCountPlugin -w /home/greg/project/UNEAK/ -c 5 -endPlugin -runfork1
# -e is the "error tolerance rate" although it is not clear to me how this stat is generated
../bin/tassel3.0_standalone/run_pipeline.pl -fork1 -UTagCountToTagPairPlugin -w /home/greg/project/UNEAK/ -e 0.03 -endPlugin -runfork1
../bin/tassel3.0_standalone/run_pipeline.pl -fork1 -UTagPairToTBTPlugin -w /home/greg/project/UNEAK/ -endPlugin -runfork1
../bin/tassel3.0_standalone/run_pipeline.pl -fork1 -UTBTToMapInfoPlugin -w /home/greg/project/UNEAK/ -endPlugin -runfork1
#choose minor and max allele freq
../bin/tassel3.0_standalone/run_pipeline.pl -fork1 -UMapInfoToHapMapPlugin -w /home/greg/project/UNEAK/ -mnMAF 0.01 -mxMAF 0.5 -mnC 0 -mxC 1 -endPlugin -runfork1

fastSTRUCTURE is fast.

fastSTRUCTURE is 2 orders of magnitude faster than structure. For everyone who has waited for structure to run this should make you happy.
Preprint of the manuscript describing it here: http://biorxiv.org/content/early/2013/12/02/001073

UPDATE comparison: Structure_v_FastStructure Structure (top) 8 days and counting FastStructure (below) 15 minutes. You can see the are nearly identical. I would expect them to be within the amount of variation you normally see on structure runs. Structure is only on K=4 (with 50,000 burnin and 100,000 reps and 5 reps of each K) where FastStructure took just under 15 minutes to go K of 1 to 10. Note: these are sorted by Q values not by individuals so, although I think it is highly unlikely, they may be discordant on an individual scale.

UPDATE on speed: ~400 individuals by ~4000 SNPs K=1 to K=10 in just under 15 minutes.

UPDATE on usage: Your structure input file must be labeled .str but in when you call the program leave the .str off. If you don’t adds it itself and will look for MySamples.str.str

It has a few dependencies which can make the installation a little intimidating. I found installing them was relatively easy and I downloaded everything, installed and tested it all in less than one hour.

The site to get fastSTRUCTURE:
http://rajanil.github.io/fastStructure/

You can follow along there, the instructions are pretty good. This is just what I did with a few notes.

At the top you can follow links to the dependencies.

Scipy + Cython

#Scipy
sudo apt-get install python-scipy python-sympy
#Cython
download + scp
tar -zxvf Cython-0.20.tar.gz
cd Cython-0.20/
make
sudo python setup.py install

Numpy

#probably the easiest way:
sudo apt-get install python-numpy python-nose

#what I did before I figured out above:
download + scp
tar -zxvf numpy-1.8.0.tar.gz
cd numpy-1.8.0
python setup.py build --fcompiler=gnu
#also installed "nose" similarly it is at least needed to run the test
#test
python -c 'import numpy; numpy.test()'

On to fastSTRUCTURE

wget --no-check-certificate https://github.com/rajanil/fastStructure/archive/master.tar.gz
#then
#check where GSL is
ls /usr/local/lib
# You should see a bunch of stuff starting with libgsl
# You can look in your LD_LIBRARY_PATH before adding that path
echo $LD_LIBRARY_PATH
# add it
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/lib
# check it worked
echo $LD_LIBRARY_PATH #mine only has ":/usr/local/lib" in it
#decompress the package
tar -zxvf master.tar.gz
cd fastStructure-master/vars/
python setup.py build_ext --inplace
cd ..
python setup.py build_ext --inplace
#test
python structure.py -K 3 --input=test/testdata --output=testoutput_simple --full --seed=100

Where to submit your manuscript?

I was recently made aware of an online tools called JANE (Journal/ Author Name Estimator), which helps you to decide where you should submit a recent paper you are working on. Essentially you copy and paste your abstract into the online tool and it spits out a ranked list of journals where you should submit your paper. It seems to work pretty well. If you test it with abstract of recently published papers, the journal where the paper is actually published almost always comes up in the top 3 choices. JANE also has other functionality of potential use, like finding citations or authors related to your abstract.

This is where it gets interesting.

I read a little more about how the program works. It basically pulls all published abstracts from pubmed and then text mines them for keywords found in your abstract. Now say you’ve written a manuscript; you run it through JANE and it tells you that the best fit is “American Journal of Botany”. But secretly, you hoped that your manuscript would go into a higher profile journal like say “Genetics”. Should you give up all hopes and submit it to AJB? Of course not! What you should do is go look at abstracts published in Genetics and AJB and find out what are the key differences between them. Often, they are surprisingly subtle and by making slight modifications to your abstract, all of a sudden it can become a great fit for Genetics!

So this online tool can actually be very useful to write abstract in the style of the journal you wish to submit to. In addition, I know at least one senior editor at a high profile journal who uses this tool to guide decisions to send manuscript out for review. This is probably not a good decision on their part, but at least now you can use it to your advantage!

GATK on sciborg

A few notes on GATK.
1. GATK requires a younger version of Java than what is on the cluster currently.
> java -jar ./bin/GenomeAnalysisTK-2.8-1-g932cd3a/GenomeAnalysisTK.jar --help                                                                                    Exception in thread "main" java.lang.UnsupportedClassVersionError: org/broadinstitute/sting/gatk/CommandLineGATK : Unsupported major.minor version 51.0
Get the linux 64 bit version here:
http://java.com/en/download/
move it to sciborg:
> scp ../Downloads/jdk-7u45-linux-x64.tar.gz user@zoology.ubc.ca:cluster/bin/
#ssh  to the cluster extract it:
> tar -zxvf jdk-7u45-linux-x64.tar.gz
#Add it to your path file or use it like so:
> ./bin/jdk1.7.0_45/bin/java -jar ./bin/GenomeAnalysisTK-2.8-1-g932cd3a/GenomeAnalysisTK.jar --help
2. You must have a reference with a relatively small number of contigs/scaffolds
Kay identified and addressed this problem and wrote this script: CombineScaffoldForGATK. I’ve modified it very slightly.
perl CombineScaffoldForGATK.pl GenomeWithManyScaffolds.fa tmp.fa
WARNING: this can print empty lines. The script could be modified to address this but I don’t want to break it. Instead you can fix it with a sed one liner:
sed '/^$/d' tmp.fa > GenomeWith1000Scaffolds.fa
3. You must also prepare the “fasta file”
You also have to index it for BWA with bwa index but GATK needs more, see: http://gatkforums.broadinstitute.org/discussion/1601/how-can-i-prepare-a-fasta-file-to-use-as-reference
In short (in the same dir as your genome):
>java -jar ../bin/picard-tools-1.105/CreateSequenceDictionary.jar R= Nov22k22.split.pheudoScf.fa O= Nov22k22.split.pheudoScf.dict
>samtools faidx Nov22k22.split.pheudoScf.fa

Quick phylogenetic trees colored by trait in R

Basically, here is some easy code to search publicly available databases to determine a trait value for a species (in this case whether or not the species was recorded as invasive in any of 5 global invasive species databases), then make a quick tree based on published phylogenies using Phylomatic (http://phylodiversity.net/phylomatic/), then color code the tree based on the trait value.

Subset of species from Asteraceae and whether they have been reported as invasive ('weedy') in any of 5 global invasive species databases.

Subset of species from Asteraceae and whether they have been reported as invasive (‘weedy’) in any of 5 global invasive species databases.

Continue reading