Biopython to Run Bioinfromatics Tools and Protein Sequences Alignment

Author : InsideDNA Time : 17 April 2017 Read time : 5 min

In our previous tutorial we described how to deal with different biological file formats (Fasta, Genbank etc.), how to search and download the data from biological databases (ExPASy, NCBI databases) with Biopython modules. Finally we fetched 44 amino acid sequences of tumor protein p53 from mammalian species and wrote them into tp53.fa file. If you did all the steps from previous tutorial you can use your output file for this pipeline or you can copy it from /data/userXXX/tutorials_data/ directory (userXXX is your user name). In this tutorial we will show how to perform different bioinformatics procedures with our protein sequences of tumor protein p53. Most of our steps will be performed with Biopython functions, so it is necessary to have a basic knowledge of Python programming language syntax.

Firstly, we need to open the terminal window by clicking “SHOW TERMINAL” button.

Biopython Tools and Protein Sequences Alignment screen 1

Type the following commands “mkdir biopython2” and “cd biopython2” to create and move into working directory. Now copy the file tp53.fa from /data/userXXX/tutorials_data/or /data/userXXX/biopython/directories (cp [directory]/tp53.fa ./).

Now we should run a ClustalW2 tool from EMBOSS package to create a multiple alignment of tp53 protein sequences:

isub -c 2 -r 7.5 -t clust -e '/srv/dna_tools/clustalw/clustalw2 -TYPE=PROTEIN -INFILE=/data/userXXX/biopython2/tp53.fa -OUTFILE=/data/userXXX/biopython2/tp53.aln -OUTPUT=CLUSTAL'

where -TYPE=PROTEIN specifies import sequence type, -OUTFILE and -OUTPUT specify output file and output file format.

You will see 2 output files tp53.aln and tp53.dnd in your working directory (ls).

tp53.aln file contains multiple alignment in clustalw file format (less).

Biopython Tools and Protein Sequences Alignment screen 2

All the following python commands you can run locally in the python command line (python) or you can create a new script with all these commands written one after another and run it locally (python [script_name].py) or submit the script as a task:

isub -c 2 -r 7.5 -t scr1 -e 'python
/data/userXXX/biopython2/[script_name].py’

For the last option you need to specify the full path for all input/output files or directories. Biopython provides many opportunities for changing file formats. For example you can run:

from Bio import AlignIO
align = AlignIO.read("/data/userXXX/biopython2/tp53.aln", "clustal")

to read an alignment file into SeqIO object and

from Bio import SeqIO
SeqIO.write(align, "/data/userXXX/biopython2/out.fa", "fasta")

to write the alignment into out.fa file in Fasta format.

Biopython Tools and Protein Sequences Alignment screen 3

tp53.dnd file contains phylogenetic tree that was built based on final alignment. You can open (less) this file, however, it is more convenient to red it in the following way:

from Bio import Phylo
tree = Phylo.read("/data/userXXX/biopython2/tp53.dnd", "newick")
Phylo.draw_ascii(tree)

 Biopython Tools and Protein Sequences Alignment screen 4

You can see that this tree looks quite similar to the real mammalian phylogenetic tree.

Now let’s try to run BLAST from python script. There are two possible options: you can run BLAST locally with your own database (Bio.Blast.Applications module) or run it over the Internet (Bio.Blast.NCBIWWW module). We will use the second option to find homologs of one of our tp53 proteins to illustrate this process. Use help command to read about qblast command line options:

from Bio.Blast import NCBIWWW
help(NCBIWWW.qblast)

Biopython Tools and Protein Sequences Alignment screen 5

record = list(SeqIO.parse("/data/userXXX/biopython2/tp53.fa", "fasta"))

read alignment file,

result_handle = NCBIWWW.qblast("blastp", "swissprot", record[0].seq)

run protein BLAST over SwissProt database,

from Bio.Blast import NCBIXML
blast_record = NCBIXML.read(result_handle)

NCBIXML.read command allows to parse BLAST XML output. Firstly, we need to specify E-value threshold:

E_VALUE_THRESH = 0.05

and then to iterate through all BLAST hits to filter out alignments with e-value (alignment.hsps)> 0.05 and print BLAST hit’s e-value, length and SwissProt id:

for alignment in blast_record.alignments:
for hsp in alignment.hsps:
print('name:', alignment.title)
print('length:', alignment.length)
print('e-value:', hsp.expect)

Biopython Tools and Protein Sequences Alignment screen 6

To see the query and BLAST hits alignments use these commands in the previous for loop:

print(hsp.query)
print(hsp.match)
print(hsp.sbjct)

Biopython Tools and Protein Sequences Alignment screen 7

You can see that most of the hits are Tumor proteins p53, p73, p63 from different species. You may search each of these proteins (using id) in SwissProt database to get an additional information, for example:

from Bio import ExPASy
from Bio import SwissPort

handle = ExPASy.get_sprot_raw("Q95330.1")
seq_record = SeqIO.read(handle, "swiss")
print(seq_record.id)
print(seq_record.description)
print(seq_record.annotations['date'])

Biopython Tools and Protein Sequences Alignment screen 8

Let’s try some more actions that we can do with our alignment. First of all, let’s calculate the summary information about our alignment. To make it we need to specify a new alphabet for alignment object which contains not only amino acids, but also gaps:

from Bio import Alphabet
from Bio.Alphabet import IUPAC
from Bio.Align import AlignInfo
filename = "tp53.aln"
alpha = Alphabet.Gapped(IUPAC.protein)
c_align = AlignIO.read(filename, "clustal", alphabet=alpha)
summary_align = AlignInfo.SummaryInfo(c_align)

Alignment summary object can be further used for different purposes. For example we can calculate a Position Specific Score Matrix (PSSM) from our alignment. PSSM is a count matrix, where for each column in the alignment the number of each alphabet letters is counted and totaled.

consensus = summary_align.dumb_consensus()
pssm = summary_align.pos_specific_score_matrix(consensus, chars_to_ignore = ['-'])
print(pssm)

Biopython Tools and Protein Sequences Alignment screen 9

Each column corresponds to one of 20 amino acids, each row corresponds to alignment positions. The numbers in the matrix are equal to the number of times when specific amino acid was observed at specific position. Xs denote positions with no clear consensus.

Also we can print this consensus without matrix:

print(consensus)

Biopython Tools and Protein Sequences Alignment screen 10

Finally we can calculate a substitution matrix from alignment summary object. Substitution matrix for protein alignments is a 20 aa * 20 aa matrix where each position is proportional to the expected probability of this specific substitution.

replace_info = summary_align.replacement_dictionary()
from Bio import SubsMat
my_arm = SubsMat.SeqMat(replace_info)
my_lom = SubsMat.make_log_odds_matrix(my_arm)
my_lom.print_mat()

Biopython Tools and Protein Sequences Alignment screen 11

As you can see diagonal values are positive because of the high probability to see the same amino acid in the same position of another homologous protein. ND values mean that there is not enough data to estimate a probability of this substitution from alignment.

You may also interested in

Biopython to Retrieving Fasta Sequences From Ncbi Databases

Follow us on Facebook and Twitter to be the first to read our new tutorials!

Run this tool More tutorials