Gene prediction is one of the most common tasks in bioinformatic analysis of newly sequenced genomes. AUGUSTUS is an excellent gene prediction tool which works with eukaryotic genomes. It allows to predict genes ab initio (de novo) or based on some hints (e.g. RNA-seq/EST, protein alignments, synthetic genomic alignment). In this tutorial we explain how to use protein profiles to improve gene search in the genomic fasta files. For this purpose, we discuss AUGUSTUS protein profile extension (PPX) and explain all steps necessary to run a prediction with an addition of a protein profile.
PPX extension allows to supplement gene prediction procedure with the information about protein family conservation. Information about protein family conservation normally comes from so called protein block profiles. Normally, protein profile files contain position-specific frequency matrices that model conserved regions in a multiple sequence alignment (MSA) with no indels. When PPX extension is used for gene prediction, those genes that match provided profiles are predicted with much higher prediction accuracy then the rest of the genes predicted ab-initio.
When use of PPX is worth it?
Imagine that you are working on the evolution of fish and make a hypothesis that adaptation to different water depths has affected color vision in a particular fish species (for instance, cichlid fish). As a result, gene families involved into formation of color vision could have greatly expanded and experience positive selection.
Or, in reverse, you work with a fish species which inhabits lightless caves and physiologically lost vision (e.g. blind cavefish) – so you want to know whether gene families involved into vision contracted and some genes became non-functional.
One of the most important vision proteins is Rhodopsin.
Given your hypothesis, you would like to learn how genes of this family were expanding or contracting in a fish genome. In order to address this question, one of the steps you would want to do is to (1) obtain Rhodopsin protein profile from PFAM database and (2) use profiles in Augustus-PPX pipeline in order to improve ab initio gene prediction for vision-related genes.
We are going to predict genes for Zebrafish species and use Rhodopsin protein profile for improved prediction of vision-related genes.
We need to download two sets of data for our analysis:
1) FASTA file with chromosome 8 for Zebrafish genome from NCBI
2) Protein profile (multiple sequence alignment) for Rhodopsin_N (PF10413) family from PFAM
You can either download full matrix and chromosome fasta file now or you will also have a chance to learn how to obtain this matrix during the tutorial.
Log in (or sign up if you have not yet) into InsideDNA application and read Introduction Tutorial to get familiar with different options available on the website. Once you learned the basics, go to Files in the top navigation menu and create a new folder called “zebrafish_augustus”.
Upload into this directory fasta file with Zebrafish chromosome 8 and full protein profile matrix (download both files).
Go to Tools in the main (top) navigation menu and create a project called zebrafish_augustus:
While we have already prepared our Rhodopsin MSA file, we are going to demonstrate how to obtain protein profile of interest using PFAM database.
Select Rhodopsin_N from the list:
On the page describing Rhodopsin_N family, go to Alignment.
In the alignment section, you have a choice between seed matrix and full matrix. On the one hand, an alignment with fewer sequences (e.g. seed) is more likely to contain a good number of blocks. This is because MSA should not contain any gaps in the blocks and thus even a single sequence with an indel can make a block useless. On the other hand, more sequences are likely to hold more useful information and will boost gene prediction accuracy.
We are going to use a full matrix. Select Download full matrix in FASTA format or get the sequence from here.
Once this task is completed, check an error log of this file. Error log is named <taskname><date><time>.err and is located in tasks_err folder. Unfortunately, our full alignment contains too many sequences: in each column, at least one of the sequences has a gap and thus our final file is 0Kb and error log shows following:
As message in the error log suggests, we should now run prepareAlign. This tool will resolve the issue by deleting gap-rich sequences from the alignment. In Tools > Zebrafish_augustus select prepareAlign and fill the settings:
When this task is completed, verify the content of PF10413_filtered.txt file in the File Manager.
Now we will convert filtered PF10413_filtered.txt to protein profile file using Msa2prfl tool. We use following settings:
Also, click on Show Secondary Parameters and fill in additional settings:
The maximal entropy of 0.75 means that block parts with an entropy of more than 75% of the random entropy are discarded. Parameter prefix_from_seqnames is useful for PFAM alignments which are usually partial alignments that do not cover the full member sequences, with sequence names followed by the coordinates in the form /NNN-NNN. If the parameter is specified, these coordinates are added to the profile coordinates.
Additionally, turn on the logging of an output log file with blocks. This option is useful to reconstruct the block motif in the original alignment sequences.
Submit this task. When this task is completed, verify output files in File Manager. The profile file will have following content:
The log file with blocks from the MSA and their scores will have content:
AUGUSTUS-PPX could already be run with the profiles, but it would take too much time on a whole chromosome when supplemented with a protein profile. We will perform a fast block search to determine regions containing profile hits. Run fastBlockSearch with the filtered profile zebrafish-filtered.prfl on the chromosome 8:
Output will look like:
This file shows coordinate, strand, mean odds-ratio score, and specificity of score, and the motif. Luckily we have found 1 hit for our search. To speed up this tutorial, we choose a region around the hit for the final AUGUSTUS run. For example we choose a region from 53000000 to 54000000. On a real dataset, you will want to run gene prediction using entire chromosome or genome assembly.
We run Augustus using protein profile and reference zebrafish training (--species option). If we were to work on a species which is not listed in the species query list (i.e. the HMM model was not yet trained for this species), we would want to select its closest evolutionary relative from the list.
In the Secondary Parameters section, we specify our profile and set up region coordinates for which we are going to do gene prediction:
For real dataset, use powerful nodes with at least 4 cores and 28 Gb RAM. For full genome prediction you may want to increase the power to 16 or 32 cores and up to 208 Gb RAM.
Once task is completed, check that you have obtained a gff track in the File Manager.
Genes are predicted for chromosome 8 from 53000000 to 54000000 positions. Now you can use genomic browser to visualize both Zebrafish reference genome and our predicted genes in gff format.
Follow us on Facebook and Twitter to be the first to read our new tutorials!Run this tool More tutorials