Guide to RNA-seq mapping with TopHat2: example of gene expression in human brain

Author : InsideDNA Time : 10 February 2016 Read time : 13 min


Analysis of RNA expression is of the most important bioinformatics tasks. However, with RNA-seq many things can go wrong which makes expression analysis very tricky. In this tutorial we provide quite a detailed guide to RNA-seq mapping and explain some of the important factors you need to consider when doing mapping. You are going to touch a fascinating RNA-seq dataset obtained from a human brain tissue and used to study changes in gene expression patterns during aging in human.

RNA-seq is a next generation sequencing method that allows us to obtain a snapshot of the RNA present in a sample and estimate its abundance. RNA-Seq provides a comprehensive gene expression profile and helps to quantify and annotate genes and isoforms. The ability to quantify the level at which a particular gene is expressed in a cell, tissue or organism provides us with valuable biological information. For example, measuring gene expression can help to:

  • Identify viral infection of a cell (viral protein expression);
  • Determine an individual's susceptibility to cancer (oncogene expression);
  • Find if a bacterial strain is resistant to penicillin (beta-lactamase expression).

Ideally, measurement of expression should be done by detecting the final gene product (for many genes this is a protein); however, technically it is often easier to detect one of the protein precursors — typically mRNA — and infer gene expression level from there.

Several important factors make analysis of RNA-seq data complex:

  1. Most of the sequencing platforms only allow for up to 400 bp read length (but see PacBio and Oxford Nanopore). Therefore, reads are generally too short to cover an expressed gene region entirely and are thus called ‘partial transcripts’.
  2. Some fraction of the sequencing reads in an RNA-seq experiment align to non-contiguous segments of the genome. Such reads are called "junction reads" — that is, reads that span the site of a splice in mRNA. Junction reads allow us to identify sites of alternative splicing, but can be complex to map and identify.
  3. In RNA-seq experiments, there are some sources of systematic variation that should be eliminated from RNA-seq data before the differential expression (DE) analysis. In particular, such variations include between-sample differences such as library size (sequencing depth) or within-sample differences, for example, in gene length, guanine-cytosine (GC) content or unwanted variation introduced by the batch effect.

A critical step in the RNA-seq data analysis is the alignment of partial transcript reads to a reference genome sequence. Reference-based alignment methods utilize the sequence of each read to find a potential mapping location either by an exact match for a reference or by scoring sequence similarity.

For read mapping, you can use many different tools, but TopHat is one of the most popular and widely used. In this tutorial, we demonstrate an alignment of a real RNA-seq sample from human brain tissue to one human chromosome. We are going to use RNA-seq reads obtained from a fascinating article, “Widespread splicing changes in human brain development and aging”, which tells us a story about splicing changes in the human brain both during its development and as it ages. Authors of the article sequenced transcriptomes obtained from the brains of people of different ages and then characterized splicing changes that take place during an entire human lifespan in the brain. An entire dataset for this tutorial can be download by clicking here or on the “Download data” button.

1. Getting started

First, you need to activate your console. Log in (or sign up if you have not yet) to the InsideDNA application and read the Introduction Tutorial and the advanced tutorial in order to familiarize yourself with the different options available on the website. Once you have learned the basics, navigate to the Files tab. Then upload the training dataset via the upload button.

2. Choice of the mapping reference

TopHat maps partial transcripts to a reference genome. Both the organism and the exact version of a reference (e.g. hg18, hg19 for human genome) are vital when mapping sequencing reads. Reads mapped to one version are NOT interchangeable with the reads mapped to a different version. Our training dataset includes a 20th chromosome reference from the human genome hg38 version and human brain RNA-seq data from the article. We are going to use only 20th human chromosomes as a reference sequencing (aka genome), since mapping to an entire genome would take too much time.

Although our tutorial only considers the 20th chromosome, we also provide an explanation of how you would deal with the entire genome. If you are not interested in this explanation, skip to step 4.

3. Compiling a complete genome from UCSC Genome Browser (optional)

Our training dataset was downloaded from the UCSC Genome Browser. However, if you want to map to a complete genome you should use this link. You only want to use FASTA files for the unmasked genome of interest. Do NOT use masked sequences with repetitive parts turned into N’s.

The archive hg38.chromFa.tar.gz has sequences of 24 canonical chromosomes (twenty-two autosomes + XY). We recommend removing the "*_random.fa" chromosomes and other non-canonical chromosomes. These FASTA files often contain part of the canonical chromosomes, in addition to the regions that cannot be placed in the assembly. The problem with these regions is that the part shared with the canonical chromosome is present twice, making it difficult to map the reads to a unique location, thus potentially increasing the number of multi-mapped reads. We want to avoid multi-mapping cases. Because you have separate FASTA files in the archive, you need to concatenate individual FASTA files into a single file. You can do this using the ‘cat’ command, which merges the files together:

cat *.fa > genome.fa

4. Indexing of the mapping reference

TopHat uses Bowtie2 to align reads. To be able to align many reads to a reference sequencing quickly, Bowtie2 creates a specially formatted database called ‘index.' For the common genomes (e.g. mouse, human, etc.), such indexed references are already created and can be downloaded from the Bowtie2 website. However, when you work with a custom reference, you need to create an indexed reference database yourself. Bowtie2 provides its indexing tool, called Bowtie2-build, to create such an index.

To create a reference, in the left menu choose ‘Console’, and click on the ‘Activate’ button.

First, let’s create an empty directory for our analysis using the ‘mkdir’ command:

mkdir mapping_tophat

Next, go to the directory ‘mapping_tophat’ using the ‘cd’ command:

cd mapping_tophat

Next, create the directory called ‘index’ (use the ‘mkdir index’ command to do so) for the chromosome sequence.

Move humanbrain.tar.gz to the ‘mapping_tophat’ directory in Files tab, or in Console using the ‘mv’ command, and unpack it using the ‘tar xvzf’ command (ignore warnings in the Console, which are unimportant in our case):

tar xvzf humanbrain.tar.gz

You will have two files in the ‘mapping_tophat’ directory — the RNA-seq FASTA file and chr20.fa.gz:

Move chr20.fa.gz into the ’index’ folder using the ‘mv’ command, and go into the ’index’ folder using the ‘cd’ command:

mv chr20.fa.gz /index/chr20.fa.gz

cd index

Unpack the gz archive with the ‘gunzip’ command (ignore warnings in the Console, which are unimportant in our case):

gunzip chr20.fa.gz

Options for Bowtie2-build are very easy - you simply need to provide a FASTA file (always put a full path name, starting from /data/userXXX/ before the file name) and the name of the output index (again with the full path). You can submit indexing using Bowtie2-build via isub wrapper. Never forget to use the idna_ prefix in front of the tool name:

isub -t index -c 1 -r 1.7 -e 'idna_bowtie2-build /data/user446/mapping_tophat/index/chr20.fa /data/user446/mapping_tophat/index/chr20'

Indexing takes about 4 minutes to complete. You can monitor task completion in ‘Tasks’ tab.

5. RNA-seq reads alignment to the genome using Tophat2

Now we are ready to launch TopHat itself.  Parameters of TopHat depend on reads quality, availability of annotation and other factors that you would like to consider. If you use annotation (-G/--GTF option), TopHat first extracts the transcript sequences and uses Bowtie to align reads to this virtual transcriptome. You also can provide only coordinates of introns (splice junctions) in a special format using option -j/--raw-juncs.

Since we search introns de novo, we specify parameters of intron length: -i option determines the minimum intron length and -I determines the maximum length of introns. Also, we define an option ‘--max-coverage-intron’ that sets the maximum intron length that may be found during the coverage search. In our example, we map reads without annotation or specified junctions.

Option -N means that the final read alignments that have more than 3 mismatches are discarded. Option ‘--read-edit-dist’ shows the minimum edit distance for accepted reads. ‘Edit distance’ is the main metric for alignment quality. It measures the minimum number of operations required to transform one string into another. More specifically, for a sequence alignment, edit distance is defined as the total number of mismatched, inserted or deleted bases in the reference genome.

Some of the reads spanning multiple exons may be mapped incorrectly as a contiguous alignment to the genome, even though the correct alignment should be a spliced one — this can happen in the presence of processed pseudogenes that are rarely (if at all) transcribed or expressed. To correct for that, we can use the option, ‘--read-realign-edit-dist’, which directs TopHat to re-align reads for which the edit distance of an alignment obtained in a previous mapping step is above or equal to this option value. If you set this option to 0, TopHat maps every read in all the mapping steps, reporting the best possible alignment found in any of these mapping steps. It may greatly increase the mapping accuracy, at the expense of an increase in running time. The default value for this option is set such that TopHat does not try to realign reads already mapped in earlier steps.

Finally, –M parameter tells TopHat that we are mapping reads to a whole genome, and thus we wish to exclude multi-mapped reads.

Our final command looks like:

isub -t mapping -c 2 -r 7.5 -e 'idna_tophat2 -N 3 --read-edit-dist 5 --read-realign-edit-dist 2 -i 50 -I 5000 --max-coverage-intron 5000 -M -o out /data/user446/mapping_tophat/index/chr20 /data/user446/mapping_tophat/L6_18_GTGAAA_L007_R1_001.fastq.gz'

Press enter to submit the task. Mapping takes 46 min. You can monitor task progress in the ‘Tasks’ section.

When running TopHat with the paired-end reads it is critical that the *_1 files and the *_2 files appear in separate comma-delimited lists, and that the order of the files in the two lists is the same.

tophat [options]* [reads1_2,...readsN_2]

TopHat allows the use of additional unpaired reads to be provided after the paired reads. These unpaired reads can be either given at the end of the paired read files on one side (as reads that can no longer be paired with reads from the other side), or they can be given in separate file(s) which are appended (comma delimited) to the list of paired input files on either side, e.g.:

tophat [options]* PE_reads_1.fq.gz,SE_reads.fa PE_reads_2.fq.gz


tophat [options]* PE_reads_1.fq.gz PE_reads_2.fq.gz,SE_reads.fa

6. Interpreting Tophat2 output

In output you can see three useful files:

  • align_summary.txt with the total number of mapped reads and multi-mapped reads. In our example, we can see that only 0.6% of reads have mapped to the genome. This is not surprising, since the 22nd chromosome contains about 1% of the whole human genome, and the remaining unmapped reads must map to the other chromosomes. Usually, if you use the entire genome as a reference, about 80-90% of all your reads align to the genome, and up to 10-15% of them have multiple alignments.
  • *.bam files with alignments of reads in special sam format (*.bam is a compressed *.sam file). accepted_hits.bam is the main file that you use for counting expression of the genes. Many tools, such as Cufflinks, can use this file as input to calculate normalized abundances of transcripts for subsequent comparison between samples. To view and manipulate these *.bam files (e.g. sort or merge) you should use samtools tool.
  • *.bed files with coordinates of introns (junctions.bed) and indels (insertions.bed and deletions.bed).

Congratulations! You have successfully mapped RNA-seq to a human genome. We hope that you enjoyed this tutorial and look forward to your feedback.

We are trying to help scientists as much as possible with InsideDNA. If you have any questions or suggestions for making tutorials more useful for you — either in terms of format/style or focus — then please do not hesitate to contact us. We are happy to respond to you either via email or over Skype and follow up your suggestions in subsequent tutorials.

Also, if you have some tools or packages that you would like to propose for future tutorials, please let us know, and we will prioritize our efforts accordingly.

Follow us on Facebook and Twitter to be the first to read our new tutorials!

Run this tool More tutorials