BayEnv

Apparently BayEnv is the way to do differentiation scans. Here I offer some scripts and advice to help the you use Bayenv.

Install, no problem get it here: http://gcbias.org/bayenv/ Unzip.

I have to say the documentation is not very good.

There is 2 things you will probably want to use BayEnv for:
1) Environmental correlations
2) XtX population differentiation
I will just be showing XtX here. Both need a covariance matrix. This is not the “matrix.out” file you get from the matrix estimation step.
. But first the SNP table.

Here is a script Greg O made to make a “SNPSFILE”* for bayenv from our standard SNP table format: SNPtable2bayenv I’ve made a small modification so that it excludes any non-bi-allelic site and to also print out a loci info file.
Use it like this:

perl SNPtable2bayenv.pl SNPfile POPfile SampleFile

This will make three files with “SampleFile” appended to the start of the name.
*not actually a SNP file at all, it is an allele count file.

Input SNPfile (all spaces are tabs):

CHROM POS S1 S2 S3 S4 S5
chr1 378627 AA TT NN TT AA
chr2 378661 AA AG NN GG AA
chr3 378746 GG AA NN AA GG
chr4 383246 NN GG NN GG TT
chr5 397421 CC GG NN GG CC
chr6 397424 GG AA NN AA GG
chr7 409691 NN CC NN NN GG
chr8 547653 AA GG NN GG AA
chr9 608412 AA GG NN NN AA
chr10 756412 GG CC NN CC GG
chr11 1065242 NN TT NN NN CC
chr12 1107672 NN TT NN TT NN

POPfile (all spaces are tabs):

S1 p1
S2 p1
S3 p2
S4 p2
...

Now estimate the covarience matrix:

./bayenv2 -i SampleFile.bayenv.txt -p 2 -k 100000 -r21345 > matrix.out

Now this is “matrix.out” file is not what you actually want. You have to use it to make a new file that contains a single covariance matrix. You could use “tail -n3 matrix.out | head -n2 > CovMatrix” or just cut it out yourself. Currently I am just using the last (and hopefully converged) values.
My CovMatrix.txt looks like this (again single tabs between):

3.009807e-05 -1.147699e-07   
-1.147699e-07 2.981001e-05

You will also need a environment file. I am not doing the association analysis so am just using some fake data. It is one column for each population and each row is the environmental values. Tab separated.
FakeEnv.txt

1 2

Now you can do the population differentiation test for each snp using a “SNPFILE”. I bet you think you are ready to go? Not so fast, you need a “SNPFILE” not a “SNPSFILE”. Not that either actually really have SNPs in them, just allele counts. Not that I am bitter. SNPFILE is just one site from the SNPSFILE. Here is a script that takes your “SNPSFILE” and runs the XtX program on each of the sites: RunBayenvXtX

perl RunBayenvXtX.pl SampleFile.bayenv.txt SampleFile.LocInfo.txt CovMatrix.txt FakeEnv.txt 2 1 

Where 2 is the number of populations and 1 is the number of environmental variables.
This will produce a file that has the site, allelecountfile name and the XtX value.

Approximate Bayesian Computations

In many cases it may be more straightforward (and informative) to test specific models using our data. An interesting approach for inferring population parameters and/or model testing is approximate Bayesian computations (ABC). There are several available tools such as msBayes, DIYABC, PopABC abctools R package, ABCTools.

Although ABC is a powerful and useful approach it has some caveats, e.g. choice of summary statistics, number and complexity of the models tested, amount of data and more. For realistic expectations and simple models ABC could really add some interesting insights to popgen studies.

Turning STACKS output into IMa2 input files

This script extract sequence haplotypes from the “alleles.tsv” files generated by STACKS and does some light filtering (you may want to add more). It’s very similar to the one I used for our 2013 Molecular Ecology paper, and still has some Great Sand Dunes-specific parameter names, but should work ok for other data sets. Oh, and I was using the “pstacks” reference-guided workflow in a slightly older version STACKS, in case that matters.

extract_haplotype_sequences_v4_annotated.r

example_alleles.tsv

Please let me know if you use this script and whether it needs tweaking.

SNP calling with ML (Greg B.)

Email for v7. Bug found that printed G’s as C’s and vise versa.

Call SNPs from sam files in a method similar to Hohenlohe et al 2010. Updated to v4 Feb 9. Previous version had a bug.

It is now fixed up for all of BWAs cigars flavors.

This only deals with reads that fit one of the following:
Full alignment (55M)
Soft clip at the start (10S45M)
Soft clip at end (45M10S)
One deletion (25M10D25M)
One insertion (20M10I20M)

This means it ignores reads with a cigar fields that have N, H, P, = or X and it ignores reads with a cigar more complicate then a single soft clip or a single indel. It also does not penalize reads adjacent to indels.

It ignores bases in soft clipped parts of reads

Continue reading

Population Genomics! (Greg B.)

Ask me for the most current version of these if you want to use them.

Several people in the lab are now working with very similar data (in structure) and have similar questions (in technique). As I have discussed with several people it would be very useful if we could all build up and share the tools needed to do these analysis. Understandably, people may want to do things on their own or in a specific manner, but I think there are several advantages to building this up together. The main thing is that it will be more efficient in terms of personnel time, having each person re-invent the wheel does not make sense. I think the blog is a great place to set do this. Below we can make our wish list and link to posts for solutions as they become available. As always, the principles (1,2 and 3) covered here apply.

Continue reading