Analysis of RNA expression is of the most important bioinformatics tasks. However, with RNA-seq many things can go wrong which makes expression analysis very tricky. In this tutorial we provide quite a detailed guide to RNA-seq mapping and explain some of the important factors you need to consider when doing mapping. You are going to touch a fascinating RNA-seq dataset obtained from a human brain tissue and used to study changes in gene expression patterns during aging in human.
RNA-seq is a next generation sequencing method that allows us to obtain a snapshot of the RNA present in a sample and estimate its abundance. RNA-Seq provides a comprehensive gene expression profile and helps to quantify and annotate genes and isoforms. The ability to quantify the level at which a particular gene is expressed in a cell, tissue or organism provides us with valuable biological information. For example, measuring gene expression can help to:
Ideally, measurement of expression should be done by detecting the final gene product (for many genes this is a protein); however, technically it is often easier to detect one of the protein precursors — typically mRNA — and infer gene expression level from there.
Several important factors make analysis of RNA-seq data complex:
A critical step in the RNA-seq data analysis is the alignment of partial transcript reads to a reference genome sequence. Reference-based alignment methods utilize the sequence of each read to find a potential mapping location either by an exact match for a reference or by scoring sequence similarity.
For read mapping, you can use many different tools, but TopHat is one of the most popular and widely used. In this tutorial, we demonstrate an alignment of a real RNA-seq sample from human brain tissue to one human chromosome. We are going to use RNA-seq reads obtained from a fascinating article, “Widespread splicing changes in human brain development and aging”, which tells us a story about splicing changes in the human brain both during its development and as it ages. Authors of the article sequenced transcriptomes obtained from the brains of people of different ages and then characterized splicing changes that take place during an entire human lifespan in the brain. An entire dataset for this tutorial can be download by clicking here or on the “Download data” button.
First, you need to activate your console. Log in (or sign up if you have not yet) to the InsideDNA application and read the Introduction Tutorial and the advanced tutorial in order to familiarize yourself with the different options available on the website. Once you have learned the basics, navigate to the Files tab. Then upload the training dataset via the upload button.
TopHat maps partial transcripts to a reference genome. Both the organism and the exact version of a reference (e.g. hg18, hg19 for human genome) are vital when mapping sequencing reads. Reads mapped to one version are NOT interchangeable with the reads mapped to a different version. Our training dataset includes a 20th chromosome reference from the human genome hg38 version and human brain RNA-seq data from the article. We are going to use only 20th human chromosomes as a reference sequencing (aka genome), since mapping to an entire genome would take too much time.
Although our tutorial only considers the 20th chromosome, we also provide an explanation of how you would deal with the entire genome. If you are not interested in this explanation, skip to step 4.
Our training dataset was downloaded from the UCSC Genome Browser. However, if you want to map to a complete genome you should use this link. You only want to use FASTA files for the unmasked genome of interest. Do NOT use masked sequences with repetitive parts turned into N’s.
The archive hg38.chromFa.tar.gz has sequences of 24 canonical chromosomes (twenty-two autosomes + XY). We recommend removing the "*_random.fa" chromosomes and other non-canonical chromosomes. These FASTA files often contain part of the canonical chromosomes, in addition to the regions that cannot be placed in the assembly. The problem with these regions is that the part shared with the canonical chromosome is present twice, making it difficult to map the reads to a unique location, thus potentially increasing the number of multi-mapped reads. We want to avoid multi-mapping cases. Because you have separate FASTA files in the archive, you need to concatenate individual FASTA files into a single file. You can do this using the ‘cat’ command, which merges the files together:
cat *.fa > genome.fa
TopHat uses Bowtie2 to align reads. To be able to align many reads to a reference sequencing quickly, Bowtie2 creates a specially formatted database called ‘index.' For the common genomes (e.g. mouse, human, etc.), such indexed references are already created and can be downloaded from the Bowtie2 website. However, when you work with a custom reference, you need to create an indexed reference database yourself. Bowtie2 provides its indexing tool, called Bowtie2-build, to create such an index.
To create a reference, in the left menu choose ‘Console’, and click on the ‘Activate’ button.
First, let’s create an empty directory for our analysis using the ‘mkdir’ command:
Next, go to the directory ‘mapping_tophat’ using the ‘cd’ command:
Next, create the directory called ‘index’ (use the ‘mkdir index’ command to do so) for the chromosome sequence.
Move humanbrain.tar.gz to the ‘mapping_tophat’ directory in Files tab, or in Console using the ‘mv’ command, and unpack it using the ‘tar xvzf’ command (ignore warnings in the Console, which are unimportant in our case):
tar xvzf humanbrain.tar.gz
You will have two files in the ‘mapping_tophat’ directory — the RNA-seq FASTA file and chr20.fa.gz:
Move chr20.fa.gz into the ’index’ folder using the ‘mv’ command, and go into the ’index’ folder using the ‘cd’ command:
mv chr20.fa.gz /index/chr20.fa.gz
Unpack the gz archive with the ‘gunzip’ command (ignore warnings in the Console, which are unimportant in our case):
Options for Bowtie2-build are very easy - you simply need to provide a FASTA file (always put a full path name, starting from /data/userXXX/ before the file name) and the name of the output index (again with the full path). You can submit indexing using Bowtie2-build via isub wrapper. Never forget to use the idna_ prefix in front of the tool name:
isub -t index -c 1 -r 1.7 -e 'idna_bowtie2-build /data/user446/mapping_tophat/index/chr20.fa /data/user446/mapping_tophat/index/chr20'
Indexing takes about 4 minutes to complete. You can monitor task completion in ‘Tasks’ tab.
Now we are ready to launch TopHat itself. Parameters of TopHat depend on reads quality, availability of annotation and other factors that you would like to consider. If you use annotation (-G/--GTF option), TopHat first extracts the transcript sequences and uses Bowtie to align reads to this virtual transcriptome. You also can provide only coordinates of introns (splice junctions) in a special format using option -j/--raw-juncs.
Since we search introns de novo, we specify parameters of intron length: -i option determines the minimum intron length and -I determines the maximum length of introns. Also, we define an option ‘--max-coverage-intron’ that sets the maximum intron length that may be found during the coverage search. In our example, we map reads without annotation or specified junctions.
Option -N means that the final read alignments that have more than 3 mismatches are discarded. Option ‘--read-edit-dist’ shows the minimum edit distance for accepted reads. ‘Edit distance’ is the main metric for alignment quality. It measures the minimum number of operations required to transform one string into another. More specifically, for a sequence alignment, edit distance is defined as the total number of mismatched, inserted or deleted bases in the reference genome.
Some of the reads spanning multiple exons may be mapped incorrectly as a contiguous alignment to the genome, even though the correct alignment should be a spliced one — this can happen in the presence of processed pseudogenes that are rarely (if at all) transcribed or expressed. To correct for that, we can use the option, ‘--read-realign-edit-dist’, which directs TopHat to re-align reads for which the edit distance of an alignment obtained in a previous mapping step is above or equal to this option value. If you set this option to 0, TopHat maps every read in all the mapping steps, reporting the best possible alignment found in any of these mapping steps. It may greatly increase the mapping accuracy, at the expense of an increase in running time. The default value for this option is set such that TopHat does not try to realign reads already mapped in earlier steps.
Finally, –M parameter tells TopHat that we are mapping reads to a whole genome, and thus we wish to exclude multi-mapped reads.
Our final command looks like:
isub -t mapping -c 2 -r 7.5 -e 'idna_tophat2 -N 3 --read-edit-dist 5 --read-realign-edit-dist 2 -i 50 -I 5000 --max-coverage-intron 5000 -M -o out /data/user446/mapping_tophat/index/chr20 /data/user446/mapping_tophat/L6_18_GTGAAA_L007_R1_001.fastq.gz'
Press enter to submit the task. Mapping takes 46 min. You can monitor task progress in the ‘Tasks’ section.
When running TopHat with the paired-end reads it is critical that the *_1 files and the *_2 files appear in separate comma-delimited lists, and that the order of the files in the two lists is the same.
tophat [options]* [reads1_2,...readsN_2]
TopHat allows the use of additional unpaired reads to be provided after the paired reads. These unpaired reads can be either given at the end of the paired read files on one side (as reads that can no longer be paired with reads from the other side), or they can be given in separate file(s) which are appended (comma delimited) to the list of paired input files on either side, e.g.:
tophat [options]* PE_reads_1.fq.gz,SE_reads.fa PE_reads_2.fq.gz
tophat [options]* PE_reads_1.fq.gz PE_reads_2.fq.gz,SE_reads.fa
In output you can see three useful files:
Congratulations! You have successfully mapped RNA-seq to a human genome. We hope that you enjoyed this tutorial and look forward to your feedback.
We are trying to help scientists as much as possible with InsideDNA. If you have any questions or suggestions for making tutorials more useful for you — either in terms of format/style or focus — then please do not hesitate to contact us. We are happy to respond to you either via email or over Skype and follow up your suggestions in subsequent tutorials.
Also, if you have some tools or packages that you would like to propose for future tutorials, please let us know, and we will prioritize our efforts accordingly.
Follow us on Facebook and Twitter to be the first to read our new tutorials!Run this tool More tutorials