Using RSEM to estimate gene and isoform expression (Sam)

RSEM is a relatively new bioinformatics tool that has been developed in conjunction with Trinity for the analysis of RNAseq data. RSEM can be used to estimate expression levels for both genes and different isoforms of genes, and is quite quick and easy to use, with an excellent google group for help (“RSEM users”). All it requires is an RNAseq dataset (either fasta or fq format) and a reference transcriptome that it can be aligned to.

The first step to using RSEM is to prepare a reference transcriptome. If you are working with Helianthus annus, there is a standard HA412 transcriptome that does nicely for this (here, I show code using HA412_trinity.fa as the reference transcriptome). There are two steps that must be done for a standard application. First, you must extract the ‘transcript-to-gene map’ using the following command to create the mapfile entitled “HA412_trinity_tgmap”:

extract-transcript-to-gene-map-from-trinity HA412_trinity.fa HA412_trinity_tgmap

Once you have extracted the transcript to gene map, you must prepare the reference:

rsem-prepare-reference –no-polyA –transcript-to-gene-map HA412_trinity_tgmap HA412_trinity.fa HA412_rsem_tgmap

Once you have prepared the reference, you can estimate expression for any number of sequence files that you want. A typical line to estimate expression for sequences called ALB_paired_1.fasta and ALB_paired_2.fasta might look like this:

rsem-calculate-expression –no-qualities –paired-end ALB_paired_1.fasta ALB_paired_2.fasta┬áHA412_rsem_tgmap ALB_aligned_HA412

That’s it…you’re done! RSEM will spit out a range of output files in a directory marked .stat. If you want to check to see distributions of how many different contigs each read was aligned to (and other stats), you can run “rsem-plot-model”, like so:

rsem-plot-model ALB_aligned_HA412 ALB_aligned_plot_stats.pdf

I have generally found that RSEM works best if you do some reasonably aggressive cleaning and filtering of the sequence files before running, as it does not really take quality scores into account (I use SnoWhite). The main output that you want from RSEM consists of two files, marked .genes.results and .isoforms.results, which have columns indicating the estimated expression for each contig and (if you ask for them) confidence intervals as well.

More instructions can be found at the RSEM homepage:

http://deweylab.biostat.wisc.edu/rsem/