SAM files processing and variants calling in bacterial genomes

Author : InsideDNA Time : 15 March 2016 Read time : 10 min

Processing and analysis of SAM files is a key step in many bioinformatics pipelines. It is essential for both biomedical and comparative bioinformatics. As we explained before, SAM files contain coordinates of reads aligned (mapped) to a reference sequence. Reference sequences can be a single assembled genome or a set of contigs or scaffolds. SAM files can be produced by various read mappers - for example, Bowtie2, BWA, BBmap. In this tutorial, we continue to work with the drug-resistant strains of Mycobacterium tuberculosis and demonstrate how to post-process, sort and clean SAM files with mapped reads and do basic variant (SNP) calling with varcall from ea-utils.

A SAM file is a TAB-delimited file. It has header lines, which start with the ‘@’ symbol, and a tag (HD, SQ, RG, PG), and alignments. Each alignment has the following columns:

  1. QNAME - Query template/pair NAME
  2. FLAG - bitwise FLAG
  3. RNAME - Reference sequence NAME
  4. POS - 1-based leftmost POSition/coordinate of clipped sequence
  5. MAPQ - MAPping Quality (Phred-scaled)
  6. CIGAR - extended CIGAR string
  7. MRNM - Mate Reference sequence NaMe (‘=’ if same as RNAME)
  8. MPOS - 1-based Mate POSistion
  9. LEN - inferred Template LENgth (insert size)
  10. SEQ - query SEQuence, on the same strand as the reference
  11. QUAL - query QUALity (ASCII-33 gives the Phred base quality)
  12. OPT - variable OPTional fields in the format TAG:VTYPE:VALUE

It is important to understand that Bowtie2 and analogous bioinformatics tools retain the same order of reads in the SAM file as they appear in the original FASTQ/FASTA files used in mapping. Nevertheless, it is possible to analyze SAM files much faster if reads are listed in the order of coordinates in the reference genome. The process of reads re-ordering is called sorting. So, the raw SAM-files need sorting. Remember, most downstream bioinformatics tools take only sorted SAM/BAM files. Sorting is done based on column 3 (RNAME) and 4 (POS) of the SAM file (so, the reference sequence name and a position). When sorting is defined, a SAM file will also contain @SQ lines defining the exact order of sorting.

Once reads are sorted, the mapped reads can be cleaned of reads of low quality. For instance, we may want to discard reads that weren’t aligned to any coordinates of a reference genome or alternatively, filter reads based on the quality of the alignment.

We will use a toolset called samtools to carry out sorting and cleaning. Once such pre-processing is complete, clean SAM files can be used to produce so-called vcf-files. VCF files contain information about all varying positions (SNPs) in a sample, relative to a reference genome. In this tutorial, we will use a tool called varcall from ea-utils tools to carry out SNP calling from the SAM files.

In this tutorial you will need a SAM-file and a reference genome. We will continue to work with the genetic information of drug-resistant strains of Mycobacterium tuberculosis. In the previous tutorials in this series we processed raw reads from bacterial genomes and obtained SAM-files with the coordinates of aligned reads relative to the reference genome. To follow all of our steps in the process you can check out the tutorials “NGS reads quality control and filtering” and “NGS reads alignment on reference genome”.

1.  Upload source data

Log in to the InsideDNA application and navigate to the Files tab. Create a new folder called “Tuberculosis”. Upload the files containing processed reads into this folder. If you completed the “NGS reads quality control and filtering” and “NGS reads alignment on reference genome” tutorials, you will already have this folder and files with aligned reads and a reference genome. You can also download these files using this link.

2.  Filtering unmapped reads from SAM files

To discard reads that don’t match any fragment of the reference genome, we will use the samtools view tool. We will also simultaneously compress the SAM file into so-called BAM-file (binary alignment map). BAM files need about 4-5 times less hard drive space in comparison with a SAM file, which is of particular importance for large datasets. In addition, many tools work faster when fed with BAM files rather than SAM, because BAM is internally optimized.

Navigate to the Files tab and click on the Console button.

You need to specify the name for your task – for example, samtools_view_tuberculosis – as well as the parameters of a compute node. You can choose node capacity by modifying the number of cores and amount of RAM using the -c and -r options. To specify the format of the input file, select the –S option (SAM-file). We want to produce an output file in compressed BAM format, therefore select the -b option. To keep reads that are aligned to a reference genome, you should use the –F option with a value of 4. The value 4 tells samtools to keep only mapped reads in the SAM-file. You can also discard all the reads except ones with a specific marker – but in this case you need to use the –f option. This could be useful if you want to explore the source of contamination of your samples, or are interested in fundamental discrepancies between your sample and the reference genome. We aim to study SNPs or point mutations, thus we need to keep only mapped reads. To write samtools view output into a file, use > sign and specify the name of this file.

-F/-f parameters allow us to filter reads based on a value in the FLAG column. The table below provides a list of possible values in the FLAG column (1, 2, 4, 8, etc). -F functions as “do not output reads that have a particular code”, whereas -f functions as “only output reads that have a particular code”. 

A very useful website, which allows you to easily identify which code you need to use for your query, is Explain Flags. Please note that -F/-f and their codes work like the AND operator. Therefore, you can combine codes and filter out interesting pieces of SAM file.

Altogether, you should type the following line in Console:

isub –t samtools_view_tuberculosis –c 1 –r 0.6 –e “idna_samtools_view –F 4 –Sb /data/userXXX/Tuberculosis/tuberculosis.sam > /data/userXXX/Tuberculosis/tuberculosis.bam”

Don’t forget to replace XXX here with your own userID, which can be found in the header of the Console tab.

The task will take few minutes and you can monitor its progress in the Task Tab.

3.  Sorting SAM files

Reads in a SAM file produced by bowtie2 are listed in the order of source read files, so perform reads in the compressed BAM file. We aim to compare reads from a sample to a reference genome, and thus it is much more convenient to sort reads according to the coordinates of the reference genome. We will use another tool from samtools set, called samtools sort, to complete this task.

As usual, to run a tool with Console, use the isub wrapper, and specify task name, number of cores and the amount of RAM. For such a small genome minimal resources should be sufficient. You only need to specify two parameters for samtools sort: the path to the input file and a prefix of the output file. Therefore, you should type the following in the Console:

isub –t samtools_sort_tuberculosis –c 1 –r 0.6 –e “idna_samtools_sort /data/userXXX/Tuberculosis/tuberculosis.bam > /data/userXXX/Tuberculosis/sorted_tuberculosis”

The task will take a few minutes for our small bacterial genome. Sorted BAM-file will take even less disk space than the original SAM one.

4.  Call for variants using a sorted BAM-file and a reference genome

Finally, we have all the components necessary to call the SNP in our sample. We will use a varcall tool from the ea-utils set for this purpose. To run varcall with the sorted BAM files and a reference fasta as input, you should type the following command into Console:

isub –t varcall_tuberculosis –c 8 –r 7.2 –e “idna_varcall -v /data/userXXX/Tuberculosis/sorted_tuberculosis.bam –f /data/userXXX/Tuberculosis/reference.fasta –m 5 –o tuberculosis_variants”

The -m parameter specifies the minimal read coverage for a variant to be included into an output vcf file - in our case, it is 5, meaning that we want a variant to be “covered” by at least 5 reads.

When the task is completed, you can explore the structure of the vcf-file, which will contain information about SNP variants in a sample compared to the reference genome.

5.  Becoming familiar with the structure of the VCF file

You can explore the structure of the vcf-file, which is human-readable, either using the command

head tuberculosis_variants.vcf

or, simply by using the Preview option of File Manager.

The header of a file begins with the ## sign and contains one or several lines with general information. Then come the lines characterizing each genetic variant. The first column specifies a chromosomal number. In our case we have only one long reference stretch, and therefore the value in the first column is identical across variants. The next column displays coordinates of the variant relative to the reference genome. The subsequent column provides a name of the variant (of course, only if the variant has been previously characterized). For example, there are some large variants in human genomes that are named as: rsXXXXXX, where Xs are numbers. The next two columns contain nucleotides present in a reference genome and in a sample, respectively. For example, in our case the reference strain of the Mycobacterium tuberculosis has C in position 1849 of the genome, and the strain in the explored sample has A. The next column characterizes the quality of the discovered variant. Parameter DP in the subsequent column is also of interest and characterizes the depth of coverage. Importantly, you must always filter the obtained variants based on the quality of the variant and the depth of the read coverage. We will come back to this filtering in one of our next tutorials.

Congratulations, you discovered genetic variants, which can underlie the exceptional drug resistance of the explored strain of Mycobacterium tuberculosis. The next steps can include prediction of genes in the genome of this bacteria and deciphering which mutations in the coding and non-coding regions might impact different phenotypes of the strains.

Follow us on Facebook and Twitter to be the first to read our new tutorials!

Run this tool More tutorials