At our first tutorial we described how to deal with different biological file formats (Fasta, Genbank etc.), how to search and download the data from biological databases (ExPASy, NCBI databases) with Biopython modules. Our second tutorial was about running different biological tools with Biopython and working with protein alignment (Substitution and Position Specific Score Matrices calculations). This tutorial will be the third and the last tutorial on Biopython. We will pay attention to different Biopython functions which might be useful for experimental biologists. We will search for restriction sites in the plasmid, predict ORFs from the plasmid nucleotide sequence and try to predict melting temperatures of PCR primers. Since all of our steps will be made with Biopython functions - it is necessary to have a basic knowledge of Python programming language syntax.
Firstly, we need to open the terminal window by clicking “SHOW TERMINAL” button.
Type the following commands “mkdir biopython3” and “cd biopython3” to create and move into working directory.
All the following python commands you can run locally in the python command line (python) or you can create a new script with all these commands written one after another and run it locally (python [script_name].py) or submit the script as a task:
isub -c 2 -r 7.5 -t scr2 -e 'python /data/userXXX/biopython3/[script_name].py’
For the last option you need to specify the full path for all input/output files or directories.
Now we need to download any plasmid nucleotide sequence. For example, we will take pBR322 plasmid. You can read detailed description of this plasmid here: https://www.neb.com/~/media/NebUs/Page%20Images/Tools%20and%20Resources/Interactive%20Tools/DNA%20Sequences%20and%20Maps/pBR322_map.pdf. This plasmid contains the origin of replication of pMB1, the rop gene, which encodes a restrictor of plasmid copy number, bla gene encoding the ampicillin resistance (AmpR) protein, and the tetA gene encoding the tetracycline resistance (TetR) protein.
You can find pBR322 nucleotide sequence in Fasta format here: https://www.neb.com/~/media/NebUs/Page%20Images/Tools%20and%20Resources/Interactive%20Tools/DNA%20Sequences%20and%20Maps/Text%20Documents/pBR322fsa.txt. You can upload this fasta file by clicking “Upload files” button or copy-paste it into pBR322.fa file.
Now imagine that we are interested in finding restriction sites in pBR322 sequence. Usually experimentalists come across this problem when planning vector transformation or, for example, RAD-Seq experiment. We have a list of restrictase enzymes and we would like to search for their sites in our plasmid.
from Bio.Restriction import *
multi = (BamHI, BsaI, BspMI, BspQI, BstZ17I, ClaI, EagI, EcoNI, EcoRI, EcoRV, HindIII, MscI, NdeI, NheI, NruI, PciI, PflFI, PshAI, PstI, PvuI, PvuII, SalI, ScaI, SgrAI, SphI, StyI)
We used this list of restrictases because we know know that their sites are present in at least one copy in our plasmid (see https://www.neb.com/~/media/NebUs/Page%20Images/Tools%20and%20Resources/Interactive%20Tools/DNA%20Sequences%20and%20Maps/pBR322_sites.pdf).
from Bio import SeqIO
record = SeqIO.read("pBR322.fa", "fasta")
for i in multi:
coords = i.search(record.seq)
for j in coords:
print(str(i) + '\t' + str(j))
In this piece of code we iterate through a list of restrictases, search for their sites and print their coordinates.
You see that our coordinates slightly differ from those at previous picture. The reason is that Biopython output coordinates of single stranded cuts made by restriction enzyme, while in official documentation you see restriction sites start coordinates. For each restriction site you can read additional information, for example:
Let’s move to the next section. Imagine that we have a new nucleotide sequence assembled and we don’t know anything about its gene composition. There are many tools for gene annotation, however, we will make a basic steps with Byopython.
Firstly, we need to find all open reading frames (ORFs) in our plasmid in all 3 frames (starting from 1st, 2nd and 3rd nucleotide) on both DNA strands. ORF is a nucleotide sequence which potentially can encode protein. It starts with start codon (ATG), ends with stop codons (TAG, TAA, TGA) and has some sufficient length without inner stop codons.
table = 11
min_pro_len = 200
tet_genes = 
for strand, nuc in [(+1, record.seq), (-1, record.seq.reverse_complement())]:
for frame in range(3):
length = 3 * ((len(record)-frame) // 3)
ind = 0
for pro in nuc[frame:frame+length].translate(table).split("*"):
ind = ind + len(pro) + 1
if len(pro) >= min_pro_len:
print("length %s, strand %s, frame %s, length %s" % (len(pro), strand, frame, ind - len(pro)))
Since there will be multiple short ORFs that occur randomly we specified length threshold (length > 200 nt). For the translate() function one need to specify a genetic code table which will be used in the search. We used table 11 which is “The bacterial, archaeal and plant plastid code”. You can print this table by the following commands:
from Bio.Data import CodonTable
standard_table = CodonTable.unambiguous_dna_by_id
There are 2 large ORFs and their coordinates coincide with tetA and bla genes coordinates. Sometimes it is useful to get RNA or protein sequence from the gene. Let’s infer mRNA and protein sequence of our predicted ORFs.
If the gene is located on the negative strand you should use the following command instead:
To check if we found a real gene we can search for it’s homologs by BLAST. We won’t discuss this part – you can read about running BLAST over the Internet in the previous tutorial on Biopython.
Finally, experimentalists might be interested in designing primers for PCR. One important parameter for choosing primers is its temperature of melting. Melting temperatures of 2 primers must be similar and belong to some temperature interval (usually 50 – 60 oC). You can check 2 sequences up- and downstream (or around) of the fragment of interest as primers (record.seq[start-20:start] and record.seq[end-20:end].reverse_complement()). “Start” and “end” are fragment coordinates, 20 is expected primer size. For each primer we can run the following commands to calculate melting temperatures.
from Bio.SeqUtils import MeltingTemp
from Bio.Seq import Seq
print('%0.2f' % MeltingTemp.Tm_Wallace(myseq))
print('%0.2f' % MeltingTemp.Tm_GC(myseq))
print('%0.2f' % MeltingTemp.Tm_NN(myseq))
We used 3 different models to calculate melting temperatures of primer. You can read about these models here: http://biopython.org/DIST/docs/api/Bio.SeqUtils.MeltingTemp-module.html. There are some additional parameters that you can specify like ion or dNTP concentrations.
print('%0.2f' % mt.Tm_NN(myseq, Na=20))
print('%0.2f' % mt.Tm_NN(myseq, Na=50))
print('%0.2f' % mt.Tm_NN(myseq, Na=50, Mg=5))
print('%0.2f' % mt.Tm_NN(myseq, Na=50, Mg=5, dNTPs=0.5))
print('%0.2f' % mt.Tm_NN(myseq, Na=50, Mg=5, dNTPs=1))
You can see that higher Na+/Mg2+ concentrations in our solution increase melting temperature (by “binding” water molecules). Higher concentrations of dNTP decrease melting temperature because free nucleotides also compete for interactions with nucleotide sequences (in my opinion).
Follow us on Facebook and Twitter to be the first to read our new tutorials!Run this tool More tutorials