Mapping sequencing reads to a reference genome with Bowtie2: a step-by-step guide

Author : InsideDNA Time : 22 February 2016 Read time : 15 min

Sequencing read mapping is a key step of the next generation sequencing data processing. It allows to find locations of the newly sequenced reads and align them with respect to a reference sequence (e.g. reference sequence, transcriptome, de novo assembly). Both RNA-seq analysis and variants calling require read mapping to be as accurate as possible. In this tutorial, we explain how to do read mapping with Bowtie2, one of the most popular tools for read alignment. We also explain how to fine-tune some of the Bowtie2 parameters in order to achieve the highest sensitive of the mapping. We continue with the data from the study of a multi-drug resistance in Mycobacterium tuberculosis.

When researchers explore sequenced DNA, in many cases they look for the differences between newly sequenced samples and genomes of the same — or in evolutionary terms not very distant — species that have previously been sequenced and assembled. A genome against which new samples are tested is called a reference genome. To identify the differences in DNA, each read of the newly sequenced sample is aligned to a fragment of a reference genome. The differences between reads and corresponding fragments of a reference genome can be used to explore intraspecific or interspecific variability.

In principle, DNA differences can be grouped into four broad classes:

  • substitutions (SVN)
  • small insertions or deletions (indels)
  • copy number variants (CNV)
  • large structural variants (SV)

Normally, SVNs and indels are collectively called SNPs and are the most common types of DNA difference used for exploration of intraspecific and interspecific variability. Insertions, deletions and inversions greater than 1000bp (sometimes even 50bp is considered as a threshold) are normally classified as CNVs and SVs. There is a range of CVN and SV types, but the most common ones are deletions, novel sequence insertions, mobile-element insertions, tandem and interspersed segmental duplications, inversions and translocations of elements  >1Kb (50bp) in size.

The very first step of DNA difference detection is the mapping of newly sequenced reads to a reference or references. This procedure allows us to identify which fragments are present or absent in the new sample, relative to the reference, and then identify changes. The dataset we are working with, from the Tuberculosis paper, used only SNPs for reconstruction of the phylogenetic trees, the inference of positive selection, and for coalescence analysis. Specifically, in supplementary materials, authors mention that they “excluded […] repetitive regions and […] indels”. Therefore, we are going to perform mapping of reads with the ultimate goal of carrying out SNP calling.

Read mapping is not a trivial task, because mappers (bioinformatics tools that do the read mapping) should be able to find the true location of each read at sufficient speed, whilst also accounting for technical sequencing errors and true genetic variation within the sample. Currently, there are over 60 different mappers, and their number is growing. In this tutorial, we will use Bowtie2 in order to map sequencing reads from the multidrug-resistant bacteria to the corresponding positions of a less detrimental strain of Mycobacterium tuberculosis. Note that we will use preprocessed NGS reads and reference sequence (download here) that we prepared in our previous tutorials on read quality control and filtering. You can find instructions on the theme of raw data processing in the tutorial “NGS reads quality control and filtering”.

1.  Upload source data for Bowtie2

Log in to the InsideDNA application, and navigate to the Files tab. Create a new folder called “Tuberculosis”. Upload the files with processed reads into this folder. If you completed the tasks from the “NGS reads quality control and filtering” tutorial, you should already have a copy of this folder and the files with the processed reads. We will also need a file with the reference genome, which can be downloaded on this page. (Click on Send => Complete Record => File => Format => FASTA, and then on the ´Create File` button).

Upload downloaded file into the Tuberculosis folder.

2.  Build index for reference genome

Bowtie2 takes as input not the reference genome itself, but its processed version, called “index”. When a genome is transformed into an “index” comparison of reads and the search for a matching location in the reference genome are both performed much faster. A special tool called Bowtie2-build creates “index” files using the reference genome as an input.

To start the genome indexing, navigate to the ´Files` tab and click on the ´Console` button. To run an analysis on the InsideDNA cloud-based cluster, we need to use a special wrapper called isub, so the structure of the command should be specified in the following format:

isub [-t TASKNAME] [-c CPU] [-r RAM] [-e “COMMAND”].

The “COMMAND” in this line should be replaced with a Bowtie2-build  command. Bowtie2-build  is a very simple command and we need only specify the path to a reference genome and the prefix for the output files. These output files will form an “index”. Altogether, you should type the following line into Console:

isub -t tub_bowtie2build -c 8 -r 7.2 -e "idna_bowtie2-build /data/userXXX/Tuberculosis/reference.fasta /data/userXXX/Tuberculosis/tuberculosis_index"

Don’t forget to replace XXX in this line with your own user ID. This user ID is written in the header of the Console.

You can monitor the progress of your task in the ´Tasks` tab.

When the task is completed, go to Files and open Tuberculosis folder. You will discover six new files with the prefix tuberculosis_index. These files have following suffixes: .1.bt2, .2.bt2, .3.bt2, .4.bt2, .rev.1.bt2, and .rev.2.bt2. In the case of a large index, these suffixes will have a bt2l termination. Altogether, they form an index and we will need all these files when running Bowtie2 for read mapping. The original fasta/fastq reference is no longer used once the index is created.

3.  Align reads using Bowtie2

Now we will use Bowtie2 to align reads to a reference genome (in the form of an index, of course), produced by bowtie2-build. Bowtie2 has many parameters, which allows us to specify the type and format of input data, mode of alignment, its sensitivity to the mismatches, and so on. You can find detailed descriptions of all bowtie2 options on this page.

A note of caution:
Never use default parameters in Bowtie2 (or any other bioinformatics tools) without first reading and understanding what they mean and how to relate the values to your particular dataset. You can read an excellent piece by Keith Bradman on the ACGT blog, on what might happen if you don’t read and understand what’s going on. Also, read the section “How to fine-tune Bowtie2” below if you would like to get a better grasp of how Bowtie2 maps reads.

The mandatory parameters that you need to specify in the command for Bowtie2 are the paths for the files with reads, the index prefix, and the name of the output file. The output file will hold coordinates of the mapped reads with respect to the reference genome (called the SAM-file, which means “sequence alignment map”). The size of the SAM-file may be roughly proportionate to the sum of input files with the reads, so you should consider this when planning experiments with large files.

Depending on the type of input reads (single reads, paired-end, mate pairs), you need to choose different input parameters. For example, in the Tuberculosis case we work with the paired-end data, so we should specify the first and second pair in -1 and -2 parameters, respectively. Our pair has a default orientation, so we don’t need to change the default parameter, which is -fr. In fact, we don’t even need to specify this parameter, because all default parameters are picked up by Bowtie2 automatically. The -U parameter allows you to specify unpaired reads. It is always useful to write Bowtie2 mapping statistics  to the file with the appropriate name, because by default it will be output to a terminal screen and lost once the terminal session is over. In order to write these statistics and to not lose this information, we need to use a special Unix redirect option ‘2>’.

Altogether, you should type the following in Console:

isub -t tuberculosis_bowtie2 -c 8 -r 7.2 -e “idna_bowtie2 -x /data/userXXX/Tuberculosis/tuberculosis_index -1 /data/userXXX/Tuberculosis/R1_paired.fq -2 /data/userXXX/Tuberculosis/R2_paired.fq -S /data/userXXX/Tuberculosis/tuberculosis_sam.sam 2>/data/userXXX/Tuberculosis/tub_stats”

XXX should be replaced by your own user ID.

Press Enter, and wait until the task is completed. You can monitor its progress in the Tasks tab.

When the task is finished, have a look at the resulting log file (we called it tub_stats). It contains important statistics: the number of aligned reads, the number of reads from the pairs, which have aligned concordantly, and the number of reads that aligned exactly once. You can download the file and view it using any text editor, or simply preview it on the InsideDNA application, using cat, less, or vi commands.

4.  How to fine-tune Bowtie2

In principle, you need to understand two important sets of parameters to be able to fine-tune Bowtie2. These are:

  1. multiseed alignment options (includes options: -D, -R, -L, -N, -i)
  2. reporting (-k, -a)

Multiseed alignment:

When Bowtie2 tries to find the matching location on the references for sequencing reads, it splits a read into so-called ‘seeds’. Seeds are substrings of a read. Therefore, to control the speed and accuracy of mapping, you need to specify the length of such substrings (-L), an interval between the substrings on a read (-i), and the number of allowed mismatches (-N). The shorter the length of a substring, the more substrings you are going to have, thus making alignment slower but more precise. Same for the interval: when an interval between substrings is small, you will have more substrings and thus a more thorough search. Again, allowing for mismatches increases comparison space, therefore once again making the process slower but more precise. Remember that interval is defined as a function of read length; therefore, you have a whole bunch of functions that will set the relationship between read length and spacing between substrings.

Because Bowtie2 uses substrings to start the search for mapping locations, there are two parameters that define how to extend seeds to a full read length. These are -D and -R. -D defines how many times Bowtie2 will try to extend the seed in order to find a new best matching location, and -R is the number of times Bowtie2 will re-seed (meaning start a substring with a different offset from that at the beginning of a read) the substring search before reporting the best hit.

Reporting:

When carrying out mapping, Bowtie2 searches for distinct, valid alignments for each read. When it finds a valid alignment, it generally will continue to look for alignments that are nearly as good or better. It will eventually stop looking, either because it exceeded a limit placed on the search effort (see parameters ­-D and -­R for how to change this), or because it already knows all it needs to know in order to report an alignment. Importantly, by default Bowtie2 will only return one valid alignment from the top matches. If you would like to return a greater number of valid alignments, you need to use either parameter -k (which allows us to specify how many such alignments to return) or -a, which will return all of the valid alignments.

5.  Reading Bowtie2 output

Now that we have general understanding of what to expect from Bowtie2 in terms of the reads alignment, let’s explore the structure of the output SAM file. SAM-format is human-readable, and you can open these files in any text editor or using cat, less, head or vi commands. Alternatively, you can use the Console command:

head /data/userXXX/Tuberculosis/tuberculosis_sam.sam

Another way to get familiar with the SAM file structure is to use the preview option on the InsideDNA application.

Like fastq files, SAM files contain reads sequences and quality values for each position of read (this means that it is possible to produce fastq files from SAM-files, if necessary).

Each read in the SAM file has a header containing information about its source. This is followed by a special code, which provides us with information about the parameters of the read alignment — i.e. whether it was aligned concordantly with its pair, aligned exactly one time, or aligned multiple times. For example, code 83 means that a read is mate1 in a pair, that it aligned to the reverse reference strand, and that a mate of this read was successfully aligned. You can find detailed information about the meaning of each code on this page (Section SAM bowtie output). After that comes the header of the source reference genome file, followed by the coordinate of the read’s alignment to the reference genome (for the leftmost character of alignment). The next value is the quality of the alignment, which is calculated based on the number of mismatches between read and reference. Using this information, you can filter the SAM file in order to only preserve reads that are aligned with high quality. This kind of processing can be carried out with the help of special toolset called Samtools.

A short explanation of what mapping quality is (from the Bowtie2 website):

"Importantly, the aligner cannot always assign a read to its point of origin with high confidence. For instance, a read that originated inside a repeat element might align equally well to many occurrences of the element throughout the genome, leaving the aligner with no basis for preferring one over the others. Aligners characterize their degree of confidence in the point of origin by reporting a mapping quality: a non-negative integer Q = -10 log10 p, where p is an estimate of the probability that the alignment does not correspond to the read's true point of origin. Mapping quality is sometimes abbreviated MAPQ, and is recorded in the SAM MAPQ field.

Mapping quality is related to "uniqueness". We say an alignment is unique if it has a much higher alignment score than all the other possible alignments. The greater the gap between the best alignment's score and the second-best alignment's score, the more unique the best alignment, and the higher its mapping quality should be.

Accurate mapping qualities are useful for downstream tools such as variant callers. For instance, a variant caller might choose to ignore evidence from alignments that have a mapping quality of less than, say, 10. A mapping quality of 10 or less indicates that there is at least a 1 in 10 chance that the read truly originated elsewhere".

In our next tutorial, we are going to look into SNP calling in more detail, and will be using Samtools for this purpose. Stay tuned!

Follow us on Facebook and Twitter to be the first to read our new tutorials!

Run this tool More tutorials