Tassel 5

Tassel 5.0 is out and include many GBS functions. Although it is GWAS oriented it could still do a lot of preprocessing for other purposes:

  • Bit level encoding of nucleotides so genetic distance and linkage disequilibrium estimates can be made very quickly (20-50X speed increases).
  • Extensive use the HDF5 file format, which has been developed as a robust element of many climate modelers for matrix style data
  • Tools for extracting and calling SNPs from extensive Genotyping-by-Sequencing data (tested for 60,000 samples by over 2.5 million SNPs and 96 million sequence alleles).
  • Projection and imputation procedures that are optimized for the large families in crops.  Some of these optimizations permit memory and computational improvements of >100,000 fold.
  • Mixed models based on DNA relationships have come to dominate GWP (Meuwissen et al 2001) and GWAS (Yu et al 2006), yet these models can be slow to solve.  TASSEL has been a test bed and implements some of the most best optimizations, such as EMMA (Kang at al 2008), plus approaches optimize variance components once P3D (Zhang et al 2010) and EMMAX (Kang et al 2010).  Compression algorithms are also available (Zhang et al 2010).  When used correctly, these optimizations make powerful GWAS computationally possible.
  • The code is being continually optimized for larger numbers of cores and clusters.  For example, we generally run imputation on 64-core machines.  And while Java provides some excellent is interoperability between systems, its code is about 2-fold slower than optimized C libraries, and 10-fold slower than GPU processing for some problems.  TASSEL5 is building out connection layers directly to native code, when these efficiencies are need.

Approximate Bayesian Computations

In many cases it may be more straightforward (and informative) to test specific models using our data. An interesting approach for inferring population parameters and/or model testing is approximate Bayesian computations (ABC). There are several available tools such as msBayes, DIYABC, PopABC abctools R package, ABCTools.

Although ABC is a powerful and useful approach it has some caveats, e.g. choice of summary statistics, number and complexity of the models tested, amount of data and more. For realistic expectations and simple models ABC could really add some interesting insights to popgen studies.

Some bioinformatics tools

New tools for analyzing NGS are coming out occasionally. Usually we tend to stick to the same tools just because they work. Well, you may find new tools faster or more accurate for your purposes. Here is a list of tools (and links) which could be used for different purposes (not only RNA-Seq). I could recommend FastQC, FastX, and trimmomatic for cleaning any raw data, and sabre is a nice tool to clean GBS.
FLASH works well for merging reads (there are more tools though) and this seems to be an interesting pre-processing approach. As for sequence aligners, except for bowtie and bwa you may find subread to be fast due to it’s different approach and additional complementary tools for splice alignments, feature counting and SNP calling. Another interesting aligner is stampy which enable to align reads to a diverged reference genome but need some pre- and post- processing to get fast results, otherwise it takes forever.

Of course, there are also some more ‘traditional’ tools for aligning bigger sequences such as MUMer (nucmer, promer) and AMOS. Although these tools are powerful they need some extra pre/post processing which is not always appreciated.