Classifying metagenomics sequences at species level using CLARK for bacterial metagenome

Author : InsideDNA Time : 01 October 2015 Read time : 8 min

Metagenomics is a hot area of scientific research. It is now extensively used in ecology, biofuel production, agriculture, human and animal health. Nevertheless, with opportunities come new challenges: in particular, data processing becomes an increasingly time consuming and computationally expensive.  The most “expensive” (time and compute-wise) task is to assign taxonomic labels to metagenomics DNA sequences. Recent bioinformatics advances try to overcome this problem by enabling efficient algorithm. Here we present one of the most recent software for metagenomics data analysis – CLARK. Compared to existing solution, it is more than 5 times faster, yet for large datasets requires powerful machine and is not available on Windows. In this tutorial we show basic usage of the tool with InsideDNA. In one of the next tutorials will demonstrate a full pipeline for bacterial, viral and human metagenomics data analysis.

CLARK is a novel bioinformatics tool for fast and accurate sequence classification. It can be useful for analysis of metagenomics and genomic datasets.  In its own terminology, CLARK requires following datasets as input:

  1. objects - fasta or fastq files with metagenomics reads to be classified
  2. targets - fasta files with reference sequences, e.g. genome assemblies of known species or individuals
  3. a simple text file listing all targets and labeling these targets into discrete groups, e.g. species, genera, families, etc.

With this information at hand, objects (i.e. newly sequenced reads) can be classified according to the references into respective groups (such as species, genera, families, etc). 

What is labeling in CLARK?

Imagine you have ten reference genomes coming from five species. Each species have two sequenced genomes coming from different samples. Five species are further grouped into 2 genera. Given this task, your labeling schema for CLARK can take two forms:

Case 1: species level labeling - you will have five target classes for labeling (each species has its own label)

  • Genome1.fa Sp1
  • Genome2.fa Sp1
  • Genome3.fa Sp2
  • Genome4.fa Sp2
  • Genome5.fa Sp3
  • Genome6.fa Sp3
  • Genome7.fa Sp4
  • Genome8.fa Sp4
  • Genome9.fa Sp5
  • Genome10.fa Sp5

Case 2: genus level labeling - you will have two target classes for labeling (each genus has its own label).

  • Genome1.fa G1
  • Genome2.fa G1
  • Genome3.fa G1
  • Genome4.fa G1
  • Genome5.fa G1
  • Genome6.fa G1
  • Genome7.fa G2
  • Genome8.fa G2
  • Genome9.fa G2
  • Genome10.fa G2

Based on these two labeling schemas, you can either classify newly sequences metagenomic dataset (objects) at species level, or at genus level. In principle, any other desired level or classification approach (non-taxonomic one) can be applied. For example, you can imagine classifying BAC/transcript to chromosome-arms/centromeres or samples coming from two different geographical locations for viral strains evolution.

CLARK comes in two flavors – CLARK and CLARK-l. CLARK is a default version which requires a lot of RAM and is designed for large datasets. CLARK-l is a light version which doesn’t require much computing resources and will be sufficient for small datasets. You can read more about CLARK and its options here and here.

In this tutorial we will demonstrate a basic usage of CLARK-l with a simplified dataset (download dataset). The dataset consists of following entities:

  • Objects - we will use metagenomic dataset from KRAKEN publication (another tool for metagenomic analysis) called HiSeq_accuracy.fa. As explained in KRAKEN publication  “the HiSeq metagenomes were built using twenty sets of bacterial whole-genome shotgun reads. These reads were found either as part of the GAGE-B project or in the NCBI Sequence Read Archive. Each metagenome contains sequences from ten genomes. For these metagenomes, 10% of their sequences were selected from each of the ten component genome data sets (i.e., each genome had equal sequence abundance)”.
  • Targets – we will use 4 reference genomes (NC_003919.fna, NC_003921.fna, NC_003922.fna, NC_020209.fna) from NCBI belonging to 2 bacterial species, Xanthomonas axonopodis and Pseudomonas poae. Dataset are obtained from NCBI here.
  • targets.txt – text file describing our target. In our targets.txt we specify that our 4 reference genomes represent 2 species (labels XA and PP for Xanthomonas axonopodis and Pseudomonas poae, respectively).

Clearly, these 4 genomes are not enough to label all metagenomics sequences in HiSeq_accuracy.fa, but it is just enough to demonstrate the basic usage of CLARK-l. Download entire dataset for this tutorial here or by clicking on Download data button at the top of the page.

1. Upload genomes, metagenome and targets.txt to InsideDNA

Log in (or sign up if you have not yet) into InsideDNA application and read Introduction Tutorial to get familiar with different options available on the website. Once you learned the basics, navigate into Files tab. Create a new folder called “clark:

Upload unrared files from our dataset into this folder.

2. Add CLARK-l to your CLARK project

Navigate into My Tools tab. Create a new project by clicking on + Add new project. Name it CLARK

Now, search in the search field for “clark”. Two tools will be returned – CLARK and CLARK-l. Click on + button on CLARK-l and choose clark project in the dropdown list. CLARK-l should appear in your clark project.

3. Initialize a task with CLARK-l (click to run)

Click on Run tool button for CLARK-l. This tool will do two things. First it will create a special database based on the reference genomes with a specific k-mer size (for CLARK-l only 27 kmer is available). Second, it will run classification algorithm with the obtained database and objects file.

You will have a Tool Settings menu opened for CLARK-l. Here you need to specify the Task name, tool parameters and computing settings. Then you will need to preview the task and submit it. Specify the task name which is easy for you to recognize later on (for example, CLARK-l_bac).

First specify input file with targets definition. Remember, you only need to list names of the fasta files, not paths to the files. Target definition is set in targets.txt

Second, select all the reference genomic files that you listed in targets.txt – 4 fasta files in our case

Third, specify input file with objects to be classified – this is our HiSeq_accuracy.fa file. If you would have had paired-end data – then you would need to select another option (Input paired-end fastq files)

Fourth, specify prefix for the output summary file

Fifth, specify directory to store database. To select it – create a new folder called db_folder

This task doesn’t require much computing power – so, keep core number and RAM low (e.g. 2 cores and 7.5 RAM). However, should you use CLARK (full version) you must select at minimum 16 or 32 cores with 104 and 120 Gb RAM.

Preview task and submit it.

4. Monitoring task progress.

Monitor the progress of your task. It will be done in a couple of minutes, but right now it is in a Running group. Once done – it will be moved to a Completed group and we can verify that nothing went wrong by looking at the error log in the right panel.

5. Obtaining the files

Now, let’s move to the File Manager (FM). Click on Files in top menu and navigate into root/clark directory. Here you will have a result.csv file and database files (ky, lb, sz).

Each read in the result.csv file is now labeled with a particular class. We had only two classed in our targets.txt (XA and PP) – thus reads will be attributed to either one of these two classes or will remain unclassified (NA). You can download obtained results.txt and post-processes it to obtain graph showing abundance of each class, phylogenetic relationship between classes, and many more fun things.

For example, with our simplified dataset, class abundance looks as follow:

Label

Count

Proportion_All(%)

Proportion_Classified(%)

N/A

9835

98.35%

N/A

PP

11

0.11%

6.67%

XA

154

1.54%

93.33%

In one of our next tutorials, we will explain how to use wrapper scripts for CLARK for pre-defined bacterial, viral and human NCBI databases and work on Human Microbiota dataset. Stay tuned!

* Preview picture is from the article  Metagenomic species profiling using universal phylogenetic marker genes

Follow us on Facebook and Twitter to be the first to read our new tutorials!

Run this tool More tutorials