Analysis of ancient DNA samples using mapDamage

Author : InsideDNA Time : 01 August 2016 Read time : 6 min

If not hold in proper conditions (i.e. -20 °C or colder) – DNA rapidly degrades, its fragments become shorter and some bases disappear due to random hydrolyzes. Moreover, some bases turn into others bases due to deamination process. Methylated cytosine transforms into thymine and adenine – into hypoxanthine, resulting in wrong sequencing reads. Because of this features analysis of ancient DNA is particularly tricky.

Unfortunately, there are only few bioinformatics tools that can tackle ancient DNA samples and produce correct results. In this Tutorial we will work with a rare example of such tool, MapDamage. MapDamage quantifies DNA damage patterns among ancient DNA sequencing reads generated by next-generation sequencing platforms. The model enables rescaling of base quality scores in SAM files according to their probability of being damaged. This is a crucial step for correct variant calling in later stages of sequence analysis.

1. Upload source data

In this Tutorial we will use mitochondrial reads of ancient indigenous of South America. Necessary data can be obtained in this research. You can download all necessary files using this link.

Log into InsideDNA application, navigate to File Manager and create a folder called “Botocudos”.

Upload files into this folder.

File contain processed mitochondrial reads, ready for alignment into Revised Cambridge Reference Sequence (rCRS) of the Human Mitochondrial DNA.

2. Align reads into mitochondrial genome using bwa aln

There were some research (for example, this one) on which aligner and with which parameter values yield best results when processing ancient DNA sequencing data. Researchers generally use BWA aligner and disable seeding to improve algorithm’s sensitivity. We will follow these recommendations, just like the authors of our data.

Open Terminal to start analysis.

To run BWA aligner, we need to create an index for our reference sequence. Our reference is sequence is rcts.fasta

Enter the following command in Terminal:

isub -t bwa_index -c 8 -r 7.2 -e "idna_bwa_index /data/userXXX/Botocudos/rcrs.fasta"

Don’t forget to replace XXX in this command with your own userID. Your user ID is shown in the header of the Terminal tab.

Press Enter and wait until your job is done. You can monitor progress of your task in Tasks bar.

When the task finishes, you can run BWA tool called BWA ALN. Enter this command in Terminal:

isub -t bwa2 -c 4 -r 3.6 -e "/srv/dna_tools/bwa/bwa aln -t 8 -i 0 -o 2 -n 0.02 -l 1024 -e 7 /data/userXXX/Botocudos/rcrs.fasta /data/user703/Botocudos/filtered_reads.fastq > /data/userXXX/Botocudos/mito_alignment_aln.sai"

The output of BWA ALN algorithm is .sai file. We need one more command (BWA SAMSE), to produce a .sam file from the .sai file:

isub -t bwa4 -c 8 -r 30 -e "/srv/dna_tools/bwa/bwa samse /data/user703/Botocudos/rcrs.fasta /data/userXXX/Botocudos/mito_alignment_aln.sai /data/userXXX/Botocudos/filtered_reads.fastq > /data/userXXX/Botocudos/mito_alignment_aln.sam"

Then we will create a compressed version of our SAM file – BAM file.

isub -t view -c 8 -r 7.2 -e "idna_samtools_view -Sb /data/userXXX/Botocudos/mito_alignment_aln.sam /data/userXXX/Botocudos/mito_alignment_aln.bam"

Let’s check how many reads were successfully aligned to our reference genome. For this purpose you can run Bamtools stats tool:

isub -t stats -c 4 -r 3.6 -e "idna_bamtools_stats -in /data/userXXX/Botocudos/mito_alignment_aln.bam > /data/userXXX/Botocudos/mito_alignment_stats.txt"

When this task is finished, you can view file mito_alignment_stats.txt. Let’s view this file using “less” UNIX command in our Terminal:

less /data/userXXX/Botocudos/mito_alignment_stats.txt

As you can see from the summary statistics, only about 10% of reads were successfully aligned onto our reference. This result is not so bad for an ancient DNA, because the sample are commonly heavily contaminated with the DNA of microorganisms living on decaying remains.

To filter out unmapped reads, let’s run samtools view one more time:

isub -t view -c 8 -r 7.2 -e "idna_samtools_view -F 4 /data/userXXX/Botocudos/mito_alignment_aln.sam /data/userXXX/Botocudos/mito_alignment_aln_filtered.bam"

We will use the resulting bam file as input for MapDamage tool.

3. Rescale values of quality in bam file using mapDamage

Before we run MapDamage, let’s create a folder called “mapdamage” for it’s output.

To run MapDamage, we will use the web interface. To see available tools with the graphic interface, navigate to Tools tab. Then enter “mapdamage” in search field. When the search results appear, select MapDamage tool and click on Run button.

MapDamage has a plenty of settings: for instance, you can specify parameters of the model, type of algorithm which predicts damaged nucleotides. We will run MapDamage with the default settings. Therefore, you only need to specify a name of a task, then select a bam file we created on a previous step (mito_alignment_aln_filtered.bam), add file with our reference (rsrs.fasta) and an output folder to analysis results (mapdamage folder). If you don’t have this folder, you can always create it in File manager or mini File manager (see details here)

Select the parameters of the virtual machine – 8 cores CPU and 7.2 Gb RAM and click on Submit button.

When the task is done, you will discover a bunch of files in your mapdamage folder in File Manager or Terminal.

Most of them contains position-specific statistics on frequencies of various types of damages. You also have some diagrams with visualization of these statistics and convergence of MCMC chains. For example, you can see such picture in fragmisincorporation_plot.pdf:

Such distribution of damages is common for ancient DNA, because DNA damage most frequently occurs at the ends of the fragments

The most interesting file for us is the one with extension .rescaled.bam. It has the SAM quality values fixed according to their probability to be damaged. Using this file you can now call SNP variants and be more sure that you accounted for damaged nucleotides in your ancient DNA samples.

4. Calling for variants using varscan

To call for the variants you need to sort rescaled bam. To sort bam file we will use samtools sort command:

isub -t bam_sort -c 8 -r 7.2 -e "idna_samtools_sort /data/userXXX/Botocudos/mapdamage/mito_alignment_aln_filtered.rescaled.bam /data/userXXX/Botocudos/rescaled_sorted"

When you obtain the sorted bam files, you can use varscan to discover varying positions between the reference mitochondrial DNA (the one you’ve used throughout the tutorial) and the DNA in the sample:

isub -t varcall -c 4 -r 3.6 -e "idna_varcall -v /data/userXXX/Botocudos/rescaled_sorted.bam -f /data/userXXX/Botocudos/rcrs.fasta -m 5 -o /data/userXXX/Botocudos/variants"

The resulting file variants.vcf will contain all information on varying positions of ancient mitochondrial genome.

Well done, now you got an overview of some stages of ancient DNA analyses!

Refer to our extensive Knowledge base for more tutorials, tips and tricks of bioinformatics and genomics analysis.

Follow us on Facebook and Twitter to be the first to read our new tutorials!

Run this tool More tutorials