RNA-Seq De Novo Assembly using Trinity

Author : InsideDNA Time : 09 July 2016 Read time : 8 min

RNA-seq de novo assembly is one the most frequent type of sequence analysis in biology and bioinformatics. However, just as a complete genome assembly, RNA-seq assembly is not trivial and often requires large amount of RAM and CPUs. In this tutorial, we explain how to use one the most popular RNA-seq assemblers - Trinity. For the assembly pipeline and results interpretation we use transcripts from a ganglion of medicinal leech from this research. In the original study, RNA-seq data helped to understand how gene expression changes along the central nervous system of the species and to affiliate location with gene behavior. 

In this tutorial we are going to explore several steps: first you will learn how to automatically obtain data from NCBI with fastq-dump utility. then you will explore how to use Terminal and an interface to run Trinity. Once the RNA-seq are assembled, we will show several ways to assess the quality of transcriptomic assembly to get you started.

Now, let's dive into RNA-seq analysis.

1. Download raw data for Hirudo medicinalis

To start a new project, login to InsideDNA, navigate to Files tab and create a new folder called “Hirudo”.

Now open Terminal. To upload raw RNA-seq data into this folder we will use a special utility, called fastq-dump, which allows to directly transfer the data from the NCBI archive into your storage space on InsideDNA. Type the following command in Terminal window:

isub -t fastqdump -c 8 -r 7.2 -e "idna_fastq-dump -F -O /data/userXXX/Hirudo/ SRR799260"

Here you need to replace XXX with your own userID. Your user ID is visible at the top of the Terminal tab.

SRR799260 is a code of a sample which we will download from the Sequence Read Archive. Such codes could be found in a Methods section of research papers. /data/userXXX/Hirudo/ is a name of a folder, where data will be uploaded and fastq-dump is a name of a tool which will find file in a database, download it and transform into fastq format.

You can monitor the progress of your task in Tasks Tab. When the task is completed, you will find the file with reads in your Hirudo folder.

2. Run Trinity to assemble transcripts from RNA-seq reads

To assemble reads into transcripts we will use Trinity tool. We will demostrate how to run Trinity via terminal and via Tool interface.

2.1. Running Trinity via an interface

First, let look at the Trinity interface. Go to Files, type in a search box "trinity" and hit search button. You see that search returned a Trinity tool. Press Run.

On tool setting page, specify Task name and input fastq files. We are going to use singe reads, therefore we activate option "Input file(s) with single reads ( --single)", click browse button and navigate to our Hirudo folder and select fastq file.

To specify output folder you must remember its name must (!) include word "trinity" in it (e.g. trinity_out_dir). We will create an output folder directly in mini File manager as shown on the figure.

Once you created trinity_out_dir folder, select it and press Select button.

Now we need to specify few more parameters specific to our data. First, we have to set the memory requirements: Trinity is one of the most RAM-hungry tools out there, so we will use InsideDNA cloud capacity in its full extent - put 60G there. You must put G suffix (!), if you don't - the program will fail. Also, we will use 16 cores, so specify 16 in the next parameter field.

Next, we use fastq files, so we keep next parameter unchanged.

Click Show secondary parameter and scroll down until you see parameter "Run in silico normalization of reads" and select from a dropdown menu --normalize_reads. Also, change parameter "Retain all intermediate input files" from dropdown menu --no_cleanup

Now we are ready to submit the task. We still have to specify RAM and CPU for the task. Select (by dragging to the right a slidebar) 16 CPU and then, in a similar fashion, drag RAM till 60. 

Prewiew the task and click Submit. You can monitor task progress in Tasks.

2.2. Running Trinity via a terminal

To run Trinity from the terminal, we need to create a small bash script. This script will create temporary folders to speed up Trinity work. If you don’t want to write script yourself, you can download it using this link, modify its content on your local machine and then upload to InsideDNA.

We will use Vi editor in a Terminal to write script.

To create an empty file and to run Vi, type the following command in Terminal:

vi /data/userXXX/Hirudo/trinity.sh

Press button with I letter (capital i) on your keyboard to enter edit mode in Vi.

Then type (or copy-paste) the following lines in Terminal:

#! /bin/bash
work_dir=/tmp/trinity_in/
mkdir $work_dir
cp /data/userXXX/Hirudo/SRR799260.fastq $work_dir
/srv/dna_tools/trinityrnaseq-2.1.1/Trinity --seqType fq --max_memory 60G --single ${work_dir}SRR799260.fastq --CPU 16 --no_cleanup --normalize_reads --output /tmp/trinity_out_dir/
tar -zcvf /tmp/trinity.tar.gz /tmp/trinity_out_dir
mv /tmp/trinity.tar.gz /data/userXXX/trinity_tmp/

Don’t forget to replace XXX with your own userID.

Press Esc button to quit editing mode, and then type wq to save your edits and quit Vi editor.

Now you have a bash script called trinity.sh in Hirudo folder.

The last thing you need to do before you run Trinity, is to create a new folder called trinity_tmp in your root folder.

To run Trinity using your script, type the following command in Terminal:

isub -t trinity_sh -c 16 -r 60 -e "bash /data/userXXX/Hirudo/trinity.sh"

-c 16 and –r 60 here means that you allocate 16 cores CPU and 60 Gb RAM to run you task. With these resources it will take Trinity about one hour to assemble transcripts from reads.

3. Analyzing Trinity output

When the task is completed, you will discover archive called trinity.tar.gz in trinity_tmp folder. To unzip it, run the following command in Terminal:

cd Hirudo

tar –zxvf trinity.tar.gz

When archive is unpacked, you will see, that it contains plenty of files.

The most interesting file, containing assembled transcripts, is called Trinity.fasta. You can explore it using less command:

less /data/userXXX/trinity_tmp/tmp/trinity_out_dir/Trinity.fasta

Assembled transcripts, or clusters of reads, have special headers in this file:

>TRINITY_DN2_c0_g1_i1 len=408 path=[801:0-377 802:378-407] [-1, 801, 802, -2]

Here TRINITY_DN2_c0 is a read cluster, g1 is a gene and i1 is an isoform. These codes have nothing to do with gene IDs from ENSEMBL or other databases, and serve just to describe relationships between clusters of reads.

How do we assess the quality of transcripts assembly? First thing we can do is to estimate N50 parameter, characterizing length of contigs.

Type the following command in console:

isub -t trinity_stats -c 8 -r 7.2 -e "/srv/dna_tools/trinityrnaseq-2.1.1/util/TrinityStats.pl /data/userXXX/trinity_tmp/tmp/trinity_out_dir/Trinity.fasta > /data/userXXX/Hirudo/stats.txt"

When this task is finished, you can view output file with statistics:

less /data/userXXX/Hirudo/stats.txt

The other thing we can do is to estimate what proportion of reads contributed to the assembly. For this purpose we can align source reads on assembled transcripts.

First, we need to create an index files for Trinity.fasta:

isub -t bowtie2build -c 4 -r 3.6 -e "idna_bowtie2-build /data/userXXX/trinity_tmp/tmp/trinity_out_dir/Trinity.fasta /data/userXXX/trinity_tmp/tmp/trinity_out_dir/index"

When this task is finished, we are going to run Bowtie tool to make read mapping. You can find a detailed tutorial about Bowtie and read mapping in our previous tutorial Mapping sequencing reads to a reference genome with Bowtie2: a step-by-step guide

isub -t bowtie -c 4 -r 3.6 -e "idna_bowtie2 -x /data/userXXX/trinity_tmp/tmp/trinity_out_dir/index -U /data/userXXX/Hirudo/SRR799260.fastq -S /data/userXXX/Hirudo/trinity.sam 2>/data/user410/Hirudo/bowtie_stats.txt"

After running this task you will obtain an alignment called trinity.sam. Yet the most interesting part for us is the statistics of this alignment in file bowtie_stats.txt. You can view this file, typing such command in Terminal:

less /data/userXXX/Hirudo/bowtie_stats.txt

As you can see, about 88% or reads aligned to assembled transcripts, which means, that a very large part of our source data contributed to the Trinity assembly (this is a good sign!)

Congratulations, now you mastered the first steps of transcriptome research! Next steps, which could include identification of coding regions, their functional annotation, search of homologues or analysis of differential expression will be covered in future Tutorials. Stay tuned!

 

Follow us on Facebook and Twitter to be the first to read our new tutorials!

Run this tool More tutorials