Samtools guide: learning how to filter and manipulate with SAM/BAM files

Author : InsideDNA Time : 22 March 2016 Read time : 9 min

Samtools is a set of utilities that manipulate alignments in the BAM format. It imports from and exports to the SAM (Sequence Alignment/Map) format, does sorting, merging and indexing, and allows to retrieve reads in any regions swiftly. In one of our previous tutorials, we mapped reads with TopHat and obtained BAM files. However, before we can use these BAM files in downstream analysis, we need to learn basic and more advanced operations which allows to deal with the file, filter them and pre-process. In this tutorial, we explain how to manipulate with BAM files with samtools - an excellent suite of bioinformatics commands which allows various operations on SAM/BAM files.

SAM file format

SAM stands for Sequence Alignment/Map format. It is a TAB-delimited text format consisting of a header section, which is optional, and an alignment section. If present, the header must be prior to the alignments. Header lines start with '@', while alignment lines do not.

A full description of the SAM format can be found here. SAM aims to be a format that:

  1. Is flexible enough to store all the alignment information generated by various alignment programs;
  2. Is simple enough to be easily generated by alignment programs or converted from existing alignment formats;
  3. Is compact in file size;
  4. Allows most of operations on the alignment to work on a stream without loading the whole alignment into memory;
  5. Allows the file to be indexed by genomic position to efficiently retrieve all reads aligning to a locus.

Example of the SAM file:

Each alignment line has 11 mandatory fields for essential alignment information such as mapping position, and variable number of optional fields for flexible or aligner specific information:

  1. Read Name
  2. SAM flag
  3. chromosome (if read is has no alignment, there will be a "*" here)
  4. position (1-based index, "left end of read")
  5. MAPQ (mapping quality - describes the uniqueness of the alignment, 0=non-unique, >10 probably unique)
  6. CIGAR string (describes the position of insertions/deletions/matches in the alignment, encodes splice junctions, for example)
  7. Name of mate (mate pair information for paired-end sequencing, often "=")
  8. Position of mate (mate pair information)
  9. Template length (always zero for me)
  10. Read Sequence
  11. Read Quality
  12. Program specific Flags (i.e. AS is an alignment score, NH is a number of reported alignments that contains the query in the current record)

Converting BAM to SAM and vice versa

'samtools view' command allows you to convert an unreadable alignment in binary BAM format to a human readable SAM format. Download the data we obtained in the TopHat tutorial on RNA expression in human brain. You can submit samtools via isub wrapper. Remember to change the path /data/userXXX/ to your user ID which you can find in the header of the console:

isub -t bam2sam -c 1 -r 1.7 -e 'idna_samtools_view -h /data/userXXX/out/accepted_hits.bam > /data/userXXX/bam_manipulating/accepted_hits.sam'

-h includes the header in the output SAM file.

This command normally writes its output to a standard output, which is your terminal by default. If the notation ‘> file’ is appended to any command that normally writes its output to a standard output, the output of that command will be written to a file instead of your terminal. Some commands like 'samtools sort' or 'samtools index' write their output to a file specified in the command itself. When tool prints its output to a standard output, it helps "piping" the tools into pipelines and redirecting output of one tool as an input into another tool. But often we want to save intermediate results into a file and thus need to redirect output with a '>' sign.

We directed the output of this command to a file called accepted_hit.sam. Now you can view your alignment with Unix command 'less':

less bam_manipulating/accepted_hits.sam

You also can view a specific region of your alignment, for instance 1000-2000 region of 20 chromosome:

isub -t bam2sam_1000_2000 -c 1 -r 1.7 -e 'idna_samtools_view -h chr20:1000-2000 /data/userXXX/out/accepted_hits.bam > /data/userXXX/bam_manipulating/accepted_hits_1000_2000.sam'

To convert back to a bam file use command:

isub -t sam2bam -c 1 -r 1.7 -e 'idna_samtools_view -b -S /data/userXXX/out/accepted_hits.sam >  /data/userXXX/bam_manipulating/accepted_hits.bam'

-b specified the output in the BAM format and -S required if input in SAM format

Sorting and indexing the BAM file

Some downstream analysis programs that use BAM files actually require indexed BAM file. For example a common tool for genome alignment visualization the Integrative Genomics Viewer needs such format (IGV).

Indexing aims to achieve a fast retrieval of alignments overlapping a specified region without going through the whole alignments. The BAM file must be sorted by the reference ID and then the leftmost coordinate before indexing.

So, first we need to do sorting command for our BAM file:

isub -t sort -c 1 -r 1.7 -e 'idna_samtools_sort /data/userXXX/out/accepted_hits.bam /data/userXXX/bam_manipulating/accepted_hits_sorted'

Pay attention to the command syntax: in 'samtools view' command the output was directed to a standard output stream, the syntax of 'samtools sort' command allows to include a prefix of the output file in the command itself. Therefore, you don’t need '>' sign to redirect the output from the screen to a file.

Now you can index your sorted BAM file with the 'samtools index' command:

isub -t index -c 1 -r 1.7 -e 'idna_samtools_index /data/userXXX/bam_manipulating/accepted_hits_sorted.bam'

This will create a file named accepted_hits_sorted.bam.bai which contains the index.

Above we reviewed the widely used samtools commands, but samtools can help you in getting additional information about you alignment. Also, we would like to show some examples of using samtools for unusual problems you may face.

Counting number of mapped reads

The number of entries in the BAM file is not the number of reads, but the number of alignments. And if there are multiple alignments allowed per read, you will have more alignments than reads. To count the number of mapped reads we can use the command:

isub -t read_number -c 1 -r 1.7 -e 'idna_samtools_view -c /data/userXXX/out/accepted_hits.bam > /data/userXXX/bam_manipulating/read_number'

The number of mapped reads written in the output file is 23641 for our example, while the number of alignments is a 23642. We can see this by using unix command 'wc -l' which counts number of lines in a file:

wc -l bam_manipulating/accepted_hits.sam

Do not forget about three header lines and that wc counts only new lines, so we should add one:

23644 - 3 + 1 = 23642

So, 23642 is the number of alignments in BAM file.

Remove low quality mapped reads

In SAM file the quality of mapped reads is defined in by so-called MAPQ values - MAPping Quality. It equals −10 log10 Probability {mapping position is wrong}, rounded to the nearest integer. A value 255 indicates that the mapping quality is not available. Now, it is very important to remember that MAPQ values generated by different aligners (e.g. Bowtie2, TopHat, BBMap, BWA) are not exactly comparable. For example, the maximum value of MAPQ score in Bowtie2 is 42, whereas in BWA - 37. You can read a great piece about why this happens in the ACGT blog post.   

Another important thing not everyone knows is how MAPQ value is actually calculated. In our of our previous tutorial, we mentioned that it defines how “uniquely” a particular read maps to a reference. In reality, of course, it is more complex than that and you should check out great post here here to understand the exact procedure on how score gets calculated.

Lastly, remember that TopHat outputs only MAPQ scores of 0, 1, 3, or 50. The first three values indicate mappings to 5 or more locations, 3-4 locations, or 2 locations, whereas a value of 50 represents a unique match. Also, in older versions of TopHat unique matches were identified with a value of 255. Therefore, you must understand that values of Bowtie and Bowtie2 (used internally by TopHat to do read mapping) produce a different range of MAPQ scores (0–42).

Now, as we understand a little bit better MAPQ scores, we can filter BAM/SAM files on the mapping quality. eg. getting all reads with a mapping quality larger than 30 (you could also use 49 if you wanted to:). This will remove all reads mapped the undesirable mapping qualities and will only keep uniquely mapped reads.

isub -t rm_quality -c 1 -r 1.7 -e 'idna_samtools_view -q 30 -b /data/userXXX/out/accepted_hits.bam > /data/userXXX/bam_manipulating/accepted_hits_q30.bam'

-q option skips the alignments with MAPQ smaller than a certain value, option -b specified output file in BAM format.

Well done! you have learned some basic samtools command and now understand better how to deal with SAM/BAM files. We will continue covering this topic in our upcoming tutorials, so keep reading!

Follow us on Facebook and Twitter to be the first to read our new tutorials!

Run this tool More tutorials