Ultrafast and accurate classification of metagenomic reads with CLARK

Author : InsideDNA Time : 08 June 2016 Read time : 9 min

The growing number of metagenomic studies in medicine and environmental sciences is creating an increasing demand on the computational infrastructure designed to analyze these very large datasets. Often, the construction of ultra-fast and precise taxonomic classifiers can compromise on their sensitivity (i.e., the number of reads correctly classified). CLARK is classification system able to taxonomically annotate short reads with high accuracy and high speed at the same time. In this tutorial, we explain how to run CLARK on metagenomic samples against large NCBI reference database. Importantly, we also introduce an user-friendly interface, accessible to all users with or without technical background.

To help users to classify metagenomic samples against bacterial, archaeal, viral and human genomes, the CLARK’s authors have created a wrapper called CLARK_classify_metagenome. This wrapper allows one to retrieve all necessary data from the NCBI database, automatically construct the database files and launch one of CLARK classifiers. Currently, there are three possible classifiers: CLARK, CLARK-l and CLARK-S (see relevant publications for CLARK/CLARK-l here and for CLARK-S here). However, running CLARK requires a powerful server that all users may not have access to. To further simplify and alleviate to the computational burden for running CLARK tools, InsideDNA team together with CLARK developers have made an easy-to-use interface for users with limited computational resources. With this interface, users only needs to specify a file to analyze (in fastq/fasta format), and to select from a drop-down menu a database and taxonomy rank for the classification (i.e., species by default).

Let’s walk through a particular example to better understand how things work. In this tutorial we will use a simulated dataset simHC.20.500.fa.gz of synthetic reads from the original publication. This dataset simulates high complexity communities lacking dominant populations called “SimHC”, containing 113 sets of reads from various microbial genomes. To generate this dataset, an arbitrarily chosen set of twenty distinct species were selected and then the first 500 reads for each genome were extracted to build a total of 10,000 reads. We would like to classify the reads against NCBI bacterial database and want to use species-level taxonomy assignment. 

1. Download source data and unpacking

Download file by clicking on Download data button or simHC.20.500.fa.gz.

Log into InsideDNA application, navigate to the  File Manager and create a folder called “clark”. We will store all the data in this folder.


Upload simHC.20.500.fa.gz into this folder.


Let’s ungzip the file. Open terminal:


Type into the console to switch to clark directory:

cd clark

Then run gunzip command to unpack the file as shown below:

gunzip simHC.20.500.fa.gz


After completion of the command you will see a file called simHC.20.500.fa.

2.1. Running all analysis in InsideDNA terminal

If you are familiar with a UNIX terminal and would like to use InsideDNA terminal task submission system, you certainly can do it. Please, make sure you read how to use it in this short tutorial. The set of commands for getting the final result and KRONA figure is below (don't forget to replace userXXX with your own user ID and make sure you run command one after another).

Submitting CLARK_classify_metagenome:

isub -t simHC_task -c 32 -r 208 -e "idna_CLARK_classify_metagenome  -k 31  -O /data/userXXX/clark/simHC.20.500.fa  -R /data/userXXX/clark/clark_simhc  -m 1  -n 32  -U bacteria  -F species"

Submitting CLARK_estimate_abundance:
isub -t simHC_est_abund -c 4 -r 3.6 -e "idna_CLARK_estimate_abundance  -c 0.5  -D /srv/dna_tools/clark_db_1.2.3  -F /data/userXXX/clark/clark_simhc.csv  -a 0  > /data/userXXX/clark/simhc_estabun --krona"

Submitting Krona ktImportTaxonomy:
isub -t simHC_krona -c 4 -r 3.6 -e "idna_ktImportTaxonomy  -o /data/userXXX/clark/krona_html  -m 3  /data/userXXX/clark/results"

We still recommend you to read throught the tutorial in order to get a deeper understanding of what's going on.

2.2. Running CLARK_classify_metagenome

Now navigate to Tools and in a search box enter “clark_classify”


You will get a list of tools relevant for clark-search. From the results of search, choose CLARK_classify_metagenome tool and click Run button


Specify task name. For example, clark_simhc. Then click on a browse button near “Input file containing objects to classify” and navigate to “clark” folder.


Select simHC.20.500.fa file and click Select button.

You have now specified a file which contains reads of unknown metagenomic sample. Now we choose from the dropdown menu Bacteria, because we want to learn which bacterial species are present in the sample. If you would want to classify against viral database, you would have to choose Viruses. In principle, you can also benchmark you sample against multiple databases at once (for example, viruses + bacteria or bacteria + human). Then you would need to specify an appropriate option from the dropdown menu.


Next, you must choose a taxonomy level. It means that each read in your sample is going to classified according to a certain phylogenetic depth, the most broad being phylum and the deepest - species. Note, the taxonomic depth doesn’t impact computational time. We will select the species level taxonomy.


Next we need to specify an output folder for the results. Click on browse button and choose a folder. Then write a prefix for the output file.


We also have to choose a k-mer size. By default it is set to 31. In principle, as k-mer size increases, precision and the average confidence score of classification are also increasing, while the sensitivity is decreasing. In the original paper, authors suggest that k-mer size of about 19-21 allows to achieve the best combination of sensitivity and precision. For this particular analysis, we will leave a k-mer of 31.


Next you should select a number of cores. We will run the analysis on 32 cores. You must always run CLARK-S and CLARK full on 32 cores and 208 RAM, because it is a very memory intensive program. You can choose less memory when using CLARK-l.


In this tutorial, we will use the default CLARK mode and therefore leave this parameter unchanged. However, if you want to run CLARK-S or CLARK-l you need to select an appropriate program here.


Finally, by dragging the scrollbars to the right we choose 32 cores and then 208 gb RAM.


Now, click on preview task to ensure that the syntax of the command is correct


And click Submit.


3. Monitoring task progress

Now navigate to the Task section in the top menu.


You can now see your task appearing in the Running section.


It may take sometime for CLARK to get the results, so be patient. When the task is finished it will appear in the Completed section


4. Checking the result of CLARK_classify_metagenome

As a result, Clark produces a very simple comma-delimited file with 3 columns. You can find this file in a “clark” folder back in Files section.


Click on preview file


First column lists all “objects” - e.g. reads from the raw sample in our case; second column contain the length of this read and third column - an assignment. Assignment means a specific reference sequence we specified in our reference group. In fact, what happens behind the screen is that all reference bacterial genomes downloaded from NCBI database get a unique number and this is exactly the number we see in the file.

To make sense and estimate abundance of each taxonomic class in our sample, we need to run a second tool, called CLARK_estimate_abundance. This script will do two things for us: 1 - it will decrypt numbers to specific taxonomic names and 2 - count how many reads fall under each taxonomic name.

5. Running CLARK_estimate_abundance

Go back to Tools and type in search box "clark_estimate" and hit search button. From the results, select tool called CLARK_estimate_abundance and click Run Tool button. This tool only needs as an input file our resulting CLARK csv.


In addition we also need to specify output folder for the result and the name


And turn on the path to the taxonomic database


Finally, we also want to produce a file that will be directly readable by Krona - a very useful tool for making a beautiful metagenomics graphs.To obtain this file, we turn on --krona option


Once all settings are specify, select 4 Cores 3.6 Gb memory


Preview the command and click submit


Monitor the task in the Tasks section

6. Visualizing the result of CLARK_estimate_abundance with Krona

Once the CLARK_estimate_abundance task is completed, navigate to Files and open folder “clark”. Here you will find two resulting file from CLARK_estimate_abundance. One is called simhc_estabun and second one - results.krn


We now need to run a final tool, called Krona, to obtain a pretty graph of the taxonomy distribution and abundance in our sample. In fact, we are going to use only one tool from the Krona toolset and this tool is called ktImportTaxonomy.

Navigate to Tools section in the top menu and type in search Krona. Choose ktImportTaxonomy and click Run button


In the settings specify the following parameters


Change parameter Column of input files to use as magnitude to 3


Specify 4 as number of cores and 3.6 as CPU (Gb) capacity and click submit. When the task is completed in the “clark” folder you will find an html report.


Download this html report to your local computer and open in any browser. You will see a following picture:


Congratulations! You have successfully completed a tutorial on metagenomic sample classification. Please send us any feedback/questions and stay tuned for future tutorials!

Follow us on Facebook and Twitter to be the first to read our new tutorials!

Run this tool More tutorials