In this tutorial we will quantify the number of segmental duplications in the Human genome. We will try to replicate the results on segmental duplications length distribution from Massip F, Arndt PF (2013) paper. However the main goal of this tutorial is not making reasonable biological conclusions but having some experience on running tools from MUMmer package on the InsideDNA platform.
In the Massip F, Arndt PF (2013) paper the mathematical model for duplications dynamics was proposed. The authors made an alignment of Human reference genome against itself. From this alignment all stretches of DNA that match exactly between different regions of Human genome (except for trivial matches of the genome regions with itself) were predicted. It was found that the number of exact matches (putative segmental duplications fragments) exceeds the number that we expect from the same random sequences alignment for any sufficient match length taken. The same result was observed after low complexity repetitive sequences masking with RepeatMasker and running alignment again. The number of exact matches with the length L was proportional to L-3. The mathematical model which explains this distribution was also proposed in this paper.
In this tutorial we will try to get the same power law distribution of exact matches length in Human genome. We will use MUMmer to align Human genome against itself. The tools of MUMmer package use a 'mummer' matching algorithm, which builds and searches a suffix tree data structure. Suffix trees can be built and searched in linear time while naive versions of dynamic programming alignment use O(n2) time. Because of high speed of alignment MUMmer is an optimal tool for long sequences (genomes) alignment.
Before running the tool we need to make several preliminary steps. It is more convenient to do them in the terminal; to open it click on the "show terminal" button and type:
mkdir mum - to create working directory;
cd mum - to move into working directory;
wget http://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/chromFaMasked.tar.gz - to download Human reference genome from UCSC database. We take the version with repeats masked by RepeatMasker and Tandem Repeats Finder;
tar -zvxf chromFaMasked.tar.gz - to uncompress the archive file;
rm -- !(chr1.fa.masked) - to remove all files in the directory except for 1st chromosome fasta file (we will limit our analysis to one chromosome).
Now we can run MUMmer for alignment. We will do this from InsideDNA interface. There are 3 main alignment modes in the MUMmer: mummer, NUCmer and PROmer. Mummer looks for exact matches between 2 sequences. NUCmer finds maximal exact matches, clusters them to form larger inexact alignment regions and outputs these extended alignments. PROmer works like NUCmer, but input sequences are translated into all six amino acid reading frames before alignment. This allows PROmer to identify regions of conserved protein sequences that may not be conserved on the DNA level. In our test we will use mummer to find exact matches. To run it you need to find mummer in the tools list ("Tools" button) and specify the following parameters:
Task name (1);
Input reference and qurey files (2, 3);
Output directory and output file name (4, 5);
Increase threshhold for minimal match length to 40 nt (6);
specify -maxmatch to output all exact matches regardless of its uniqueness (7);
specify -r to search for reverse complement matches (8);
specify -n to match only A, C, T, G characters but not Ns (9);
specify RAM size and cores number (10);
click "Submit task" to run the tool.
Eventually we will have a list of all exact matches of length > 40 nt between forward and reverse strands of the genome.
Run MUMmer once again with all the same arguments except for changing -maxmatch argument on -mum and output file. This argument makes mummer output matches that are unique in both the reference and query. Running the tools twice might sound strange, but I will give an explanation. We used alignment between forward and reverse strand to get rid of trivial alignments of the region with itself (biologically the rest is the same between forward and reverse strands). Mummer with -maxmatch argument reports all possible pairs of exact matches so for the region which is present in n copies in the genome we expect 2n(n-1) matches reported. Mummer with -mum argument, on the other hand, reports only unique matches. Since we are interested in the number of matches of each length - none of these results is actually what we are looking for. Filtering the output file with match coordinates is a quite complicated programming task, however we will use an approximation based on the fact that the true number of exact matches of specific length lies between mummer -maxmatch and mummer -mum results. With this approximation we can skip the filtering step.
Now let's take a look at the output files maxmatch/mum (use "Preview" button). You will see 3 columns: 1st one - reference coordinate of the match, 2nd column - query coordinate of the match; 3rd column - the match length.
 Massip F., Arndt P. F. " Neutral evolution of duplicated DNA: an evolutionary stick-breaking process causes scale-invariant behavior" - Physical review letters (2013)
Follow us on Facebook and Twitter to be the first to read our new tutorials!Run this tool More tutorials