Analyzing differential expression of de novo assembled transcriptomes using RSEM

Author : InsideDNA Time : 27 October 2016 Read time : 6 min

We already discussed assembly of single transcriptome in our Tutorial for RNA-Seq de novo assembly using Trinity. To compare transcriptomes of several samples, it is recommended to start analysis in slightly different way. At first you need to assemble transcriptome using reads of all your samples. Then you align reads of each sample to this common transcriptome separately and compare the results. We will start this tutorial with transcriptome assembled from reads of two samples obtained from two ganglion of medical lynch. You can assemble it yourself using our previous tutorial and samples SRR799260 and SRR799265 of this research, or just download ready file with assembled transcriptome by clicking at Download data button above.

1. Upload files with transcriptome and reads

If you have not followed our previous tutorial, you need to create a folder for your project called Hirudo. Navigate to Files tab, click on Create folder button and enter a folder name.

Upload file with transcriptome into this folder.

Now you need to download reads from which this transcriptome was assembled. Open Terminal:

And enter the following commands:

isub -t fastqdump_1 -c 4 -r 3.6 -e idna_fastq-dump -F -O /data/userXXX/Hirudo/ SRR799260

isub -t fastqdump_2 -c 4 -r 3.6 -e idna_fastq-dump -F -O /data/userXXX/Hirudo/ SRR799265

XXX here should be replaced with your own userID, which can be found in the header of the Terminal tab.

Press enter and wait for downloading read files. You can monitor progress of your tasks in Tasks tab.

2. Align and estimate transcript abundance using RSEM

Now to estimate transcript abundance, we need to count how much read corresponds to each assembled transcript. For this purpose we will run RSEM, which uses bowtie to align reads to transcriptome and estimate individual transcript abundance using obtained alignment.

Let’s create two folders called ganglion_2 and ganglion_19 in Hirudo folder for RSEM output for our two samples.

To run RSEM for each of our samples enter the following commands in the Terminal:

isub -t trinity_rsem_1 -c 8 -r 7.2 -e "/srv/dna_tools/trinityrnaseq-2.1.1/util/align_and_estimate_abundance.pl --transcripts /data/userXXX/Hirudo/Trinity.fasta --seqType fq --single /data/userXXX/Hirudo/SRR799260.fastq --est_method RSEM --aln_method bowtie --trinity_mode --prep_reference --output_dir /data/userXXX/Hirudo/ganglion_2/"

isub -t trinity_rsem_2 -c 8 -r 7.2 -e "/srv/dna_tools/trinityrnaseq-2.1.1/util/align_and_estimate_abundance.pl --transcripts /data/userXXX/Hirudo/Trinity.fasta --seqType fq --single /data/userXXX/Hirudo/SRR799265.fastq --est_method RSEM --aln_method bowtie --trinity_mode --prep_reference --output_dir /data/userXXX/Hirudo/ganglion_19/"

When these two tasks finishes, you will discover a few files in ganglion_2 and ganglion_19 folders.

In each folder you can see the file bowtie.bam based on which the estimates are made and you can also see files RSEM.genes.results and RSEM.isoforms.results, which are of the most interest for us. You can have a brief look at files content, using less command in Terminal:

less  /data/userXXX/Hirudo/ganglion_2/RSEM.genes.results

As you can see, each gene of isoform got an estimate of its effective length (roughly number of times, read can be aligned to this transcript) and its abundance in TPM and FPKM. FPKM is for fragments per kilobase transcript length per million fragments mapped and TPM is transcripts per million. TPM is generally preferred to FPKM because sum of all transcripts always comes to 1 million across samples.

Before going on in processing oug data, let’s rename files RSEM.genes.results and RSEM.isoforms.results according to the sample names (to ganglion_2.genes.results and ganglion_2.isoforms.results, for example).

Names of the files matter, because in the next step we will combine data for our samples in single table.

3. Building gene expression matrixes

Following command will help you to obtain gene expression matrix necessary for differential expression analysis:

isub -t est_to_matrix -c 8 -r 7.2 -e "/srv/dna_tools/trinityrnaseq-2.1.1/util/abundance_estimates_to_matrix.pl --est_method RSEM --out_prefix /data/userXXX/Hirudo/genes_matrix /data/userXXX/Hirudo/ganglion_2/ganglion_2.genes.results /data/userXXX/Hirudo/ganglion_19/ganglion_19.genes.results"

You can produce similar file for isoforms, but in this Tutorial we will analyze data only on gene level.

When the task finishes, you will find few matrixes with prefix genes_matrix in Hirudo folder.

counts.matrix file is used for differential expression analysis and TMM.EXPR.matrix file is used for most other types of downstream analysis. File trans_counts.TPM.not_cross_norm contain values that are not normalized which is useful for some special cases. You can explore the structure of matrixes using preview option or by using thecommand less in terminal:

less /data/userXXX/Hirudo/genes_matrix.counts.matrix

Files contain estimations of abundance for each transcript in both samples.

Using these files, we are ready to run differential expression analysis.

4. Analyzing differential expression using EBSeq

To detect differentially expressed genes in your samples, type the following command in Terminal:

isub -t dif_expr2 -c 8 -r 7.2 -e "/srv/dna_tools/rsem-1.2.31/rsem-run-ebseq /data/userXXX/Hirudo/genes_matrix.counts.matrix 1,1 /data/userXXX/Hirudo/ expr.result"

1, 1 here stands for the single replicates of each condition in our sample, since we have only one sample for ganglion 2 and one sample for ganglion 19. For correct analysis you should have at least 3 replicates of each condition. In case of two conditions with 3 replicates, you need to specify 3,3 in this command.

Last thing we need to do is to filter the results by their significance. Given the acceptable false discovery rate of 1%, you should use the following command:

isub -t dif_expr_filter -c 8 -r 7.2 -e "/srv/dna_tools/rsem-1.2.31/rsem-control-fdr /data/userXXX/Hirudo/expr.result 0.01 /data/userXXX/Hirudo/expr_filtered.txt"

Resulting file called expr_filtered.txt contain the list of genes which are considered significantly differentially expressed. You can view it using less command in Terminal:

less /data/userXXX/Hirudo/expr_filtered.txt

Genes are sorted in decreasing order of probability of being differentially expressed. You can find these values called PPDE in third column of file.

Good job! Now you have mastered the basics of differential expression analysis of de novo assembled transcriptome. As you may have noticed, our analysis yielded only codes for differential expression of genes, which Trinity originally gave to assembled transcripts. They have no biological meaning and you will have to figure out, to what known genes your assembled transcripts correspond to. We will cover this step – annotation of assembled transcriptome – in one of our future Tutorials.

You may also interested in

RNA-Seq De Novo Assembly using Trinity

Guide to RNA-seq mapping with TopHat2: example of gene expression in human brain

Follow us on Facebook and Twitter to be the first to read our new tutorials!

Run this tool More tutorials