Introduction To Variant Calling In Samtools And Freebayes

Author : InsideDNA Time : 31 August 2017 Read time : 6 min

This tutorial once again covers the topic of variants calling from DNA-Seq data and variants annotation. You can look at some of our tutorials related to this topic (for example, https://insidedna.me/tutorials/view/variant-annotation-snpeff-human-tumor). In this tutorial we will use some more tools from SAMtools + BCFtools and FreeBayes packages to call variants (SNVs and short indels) from processed alignment files. Also we will use the easiest way to annotate our resulting variants with SnpEff. Usually the pipelines related to variants calling start from unaligned reads files and include filtering, alignment, duplicates masking etc. steps, however, in this tutorial we will call the variants from already processed alignment file.

Firstly, we need to create the working directory. To do this you need to open the terminal (''Show terminal" button) and type the following commands:

mkdir call - to create working directory;

cd call - to move into working directory;

Introduction to variant calling in Samtools and Freebayes screen 1

Than we need to download 4 files from 1000 Genomes project database: NA12878 sample alignment file in BAM format, alignment index file, reference Human genome (hg19) fasta file and reference index file.

wget ftp://ftp.ncbi.nlm.nih.gov/1000genomes/ftp/phase3/data/NA12878/alignment/NA12878.chrom20.ILLUMINA.bwa.CEU.low_coverage.20121211.bam
wget ftp://ftp.ncbi.nlm.nih.gov/1000genomes/ftp/phase3/data/NA12878/alignment/NA12878.chrom20.ILLUMINA.bwa.CEU.low_coverage.20121211.bam.bai
wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/human_g1k_v37.fasta.gz
wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/human_g1k_v37.fasta.fai

gunzip human_g1k_v37.fasta.gz - to extract file from archive;
mv [old_filename] [new_filename] - to change the filenames on shorter ones;
ls - to see the files in the directory;

Introduction to variant calling in Samtools and Freebayes screen 2

As you can see we downloaded only the part of alignment that comes from 20th chromosome of Human genome in order to reduce the time for further steps. Then it is better (but not necessary) to reduce our reference genome file to only one (20th) chromosome since we don't need the rest:

awk 'BEGIN {RS=">"} /20 / {print ">"$0}' human_g1k_v37.fasta > chr20.fa

We also need to index this new fasta file:

isub -c 1 -r 1.7 -t ind -e '/srv/dna_tools/samtools/bin/samtools index /data/userXXXX/call/chr20.fa'

Now let's run FreeBayes tool for variants calling. In the nutshell, FreeBayes is a Bayesian variant detector which is used to find small-scale variants (SNVs, indels and complex events). FreeBayes reconstructs haplotypes from reads data and then finds variants based on predicted haplotypes. To run FreeBayes we need to run the following command:

isub -c 1 -r 1.7 -t free -e '/srv/dna_tools/freebayes/freebayes -f /data/userXXX/call/human_g1k_v37.fasta /data/userXXXX/call/NA12878.chrom20.bam > /data/userXXXX/call/NA12878.free.vcf'

where -f specifies reference fasta file;

As a result we will see the VCF file that includes all predicted variants (reference + observed allele), their qualities etc. (use less command to read the file).

Introduction to variant calling in Samtools and Freebayes screen 3

Now let's switch to one more alternative method for variants calling. There are 2 steps that include samtools mpileup and bcftools commands. Samtools mpileup takes reference and alignment input files and for each position of reference genome outputs alignment information that includes: reference base, the number of reads covering the site, read bases, base qualities and alignment mapping qualities. This information is written in VCF or BCF (VCF compressed binary file) format. Bcftools uses this data to actually predict variants for each position independently based on quality data and the bases observed in the reads covering this position. Bcftools also outputs VCF or BCF files with predicted variants. To run these tools use the following commands:

isub -c 1 -r 1.7 -t mpl -e '/srv/dna_tools/samtools/bin/samtools mpileup -fuv /data/userXXXX/call/chr20.fa /data/userXXXX/call/NA12878.chrom20.bam > /data/userXXXX/call/NA12878.mpl.vcf'

where -f specifies reference file, -u and -v options are specified to output in uncompressed VCF file.

isub -c 1 -r 1.7 -t bcf -e '/srv/dna_tools/samtools/bin/bcftools view -SNv /data/userXXXX/call/NA12878.mpl.vcf > /data/userXXXX/call/NA12878.bcf.vcf'

where -S specifies that input is VCF file, -N - to skip sites that are not A/T/C/G in reference sequence, -v - to output only variants sites.

As a result, there will be a VCF file which contains variants predicted by BCFtools with the same structure as the one from FreeBayes output.

Now let's try to annotate SNVs from one of our output VCF files. Annotated SNVs will have an additional information about the genes where they fall and relative position (upstream, intron, exon etc.). Before running SnpEff tools we need to create the dictionary file for our reference genome. Run the following command:

isub -c 1 -r 1.7 -t dict -e 'java -jar /srv/dna_tools/picard_1.138/dist/picard.jar CreateSequenceDictionary R= /data/userXXXX/call/chr20.fa O= /data/userXXXX/call/chr20.dict'

where R= and O= options specify reference fasta file and the output dictionary name respectively.

You will see the output dictionary file with the simple structure: one line for each fasta entry (only one in our case) which includes the length of the entry sequence.

Introduction to variant calling in Samtools and Freebayes screen 4

Now we can run the annotation step itself:

isub -c 4 -r 3.6 -t ann -e 'java -Xmx4g -jar /srv/dna_tools/snpEff/snpEff.jar ann -c /srv/dna_tools/snpEff/snpEff.config hg19 /data/userXXXX/call/NA12878.free.vcf > /data/userXXXX/call/NA12878.ann.vcf'

hg19 specifies the database to use. This database includes annotation data on Human genome hg19. Other popular databases are also installed at InsideDNA platform.

As a result we will see the VCF file with several additional columns added that include gene related information (annotation itself).

Introduction to variant calling in Samtools and Freebayes screen 5

To reduce this file and make it more readable we can turn it into simple table in the following way:

isub -c 4 -r 3.6 -t table -e 'java -jar /srv/dna_tools/gatk/GenomeAnalysisTK.jar -T VariantsToTable -R /data/userXXXX/call/chr20.fa -V /data/userXXXX/call/NA12878.ann.vcf -F CHROM -F POS -F REF -F ALT -F ANN -o /data/userXXXX/call/NA12878.tab'

where -R option specifies the reference file, -V - input VCF file, -o - output file name and -F - the columns to include in the output table.

The resulting table includes the columns that we specified and is much easier to read and analyze.

Introduction to variant calling in Samtools and Freebayes screen 6

You may also interested in

How to use GATK and Picard to filter low quality variants?

Follow us on Facebook and Twitter to be the first to read our new tutorials!