It’s almost impossible to obtain a DNA sample containing genetic material of only one organism. Due to the specifics of sequencing process even a single foreign DNA molecule can be amplified to become detectable. In some cases researchers intentionally take DNA samples of multiple species – such samples are called metagenomics. In such cases you might need to split reads of multiple organisms by their source. In this tutorial we will use BBSplit- a metagenomics tool to produce multiple files with reads from different genomes from one mixed file.
In our tutorial “Binning reads using BBSplit- a metagenomics tool” we will split reads of an ancient human. Like most of the ancient DNA species, they are very likely to be contaminated by DNA of other organisms. We will split reads of ancient human DNA from reads of the most common bacteria inhabiting human skin - Staphylococcus epidermidis and the most common object of laboratory research - Escherichia coli. As the source file we will be using a fraction of reads of ancient human DNA obtained in this research. You can download file with source reads using this link.
Now we need to download genomes of organisms which could have likely contaminated the sample – in our case, Staphylococcus epidermidis and Escherichia coli. To download genomes, visit corresponding pages for Staphylococcus epidermidis and Escherichia coli in NCBI genome database and click on Download sequences in FASTA format for the genome link on each page.
Unpack the archives and rename the genomes into ecoli.fasta and sepidermis.fasta.
Log into InsideDNA application and navigate to files Tab. Create a new Folder called Bbsplit.
Upload files with reads and genomes into this folder.
Now we are ready to run BBSplit tool.
We will use Terminal to run BBSplit. Navigate to Terminal Tab and press Connect button.
Enter the following command in Terminal:
isub -t bbsplit –c 8 -r 7.2 -e "/srv/dna_tools/bbmap-36.02/bbsplit.sh in=/data/userXXX/Bbsplit/Filtered.fastq qin=33 ref=/data/userXXX/Bbsplit/ecoli.fasta,/data/userXXX/Bbsplit/sepidermidis.fasta basename=/data/userXXX/Bbsplit/out_%.fastq outu=/data/userXXX/Bbsplit/clean.fastq"
You should put your own userID instead of each XXX in this command. You can find your ID in the header of the Terminal tab.
Bbsplit here is a name of task which will run on 8 cores CPU with 7.2 GB RAM.
–e option is followed by the command for BBSplit tools and the command is placed inside the quotes. "/srv/dna_tools/bbmap-36.02/bbsplit.sh is a path to BBSplit tools in our platform
in= is followed by the path to the file with reads called Filtered.fasta.
ref= is followed by the paths to the two bacterial genome reads of which we want to split out. qin=33 stands for phred33 quality score of input reads (which is the most common case). basename= is followed by the path to output files for reads of input bacterial genomes.
For examples reads of Escherichia coli in our case will be output into the file called out_ecoli.fastq, which will be created in Bbsplit folder.
outu= is followed by the path to reads which were not mapped into any of input genomes (and which, we should hope, belong mostly to ancient human). In our case this file will be called clean.fastq.
Press Enter to submit command. Now you can move to Tasks folder to monitor the progress of your task. It will take about 6 minutes.
When your job is done, navigate to Files tab and open Bbsplit folder to view the resulting files.
As you can see, most of the input reads which did not fit to bacterial genomes were placed to file clean.fastq. Even though the files for Staphylococcus epidermidis and Escherichia coli genetic material contain some reads as you are interested in DNA of ancient human, you should discard bacterial reads and continue to work with ones from clean.fastq file.
Congratulations, now you can split reads into several files by corresponding genomes!
Follow us on Facebook and Twitter to be the first to read our new tutorials!Run this tool More tutorials