Using InsideDNA console-to-cloud access to assemble bacterial genomes with Velvet

Author : InsideDNA Time : 24 November 2015 Read time : 8 min

Console (shell, command line) is an essential tool for majority of bioinformatics tasks. As soon as you need to parse genomic data and analyze it even in a slightly non-trivial way, usage of the console is unavoidable. One of the most common issues with a console is that user is either bound to his own machine (with a limited RAM and CPU) or to a cluster that may not have all the tools installed, be overloaded with tasks, or simply would not have a node capacity necessary for a task completion. In spite of a rising popularity of containers that solve burden of tool installation, the cluster capacity and its load remain a pressing issue in bioinformatics. In this tutorial, we explain how InsideDNA resolves this issue by offering HPC/PC-like console experience, but in a cloud environment. From now on, you don’t need to worry neither about number of available nodes, nor about their capacity: should you need to assemble 100 genomes each on a 200 GB RAM and 32 cores, InsideDNA will instantly scale to a needed capacity as you submit tasks.

Bioinformatics data analysis is a computationally intense task. It is often viewed as one of the top Big Data processing tasks of today. Yet, traditional HPC environment is often not very well-tailored for bioinformatics data analysis, because HPC clusters are designed to send relatively small chunks of tasks via MPI protocol to many relatively weak nodes. In bioinformatics, on the other hand, only few programs support multi-node parallelization and many programs (e.g. genome assemblers) require a lot of RAM/CPU on each node. To get a good sense of why there is an issue, you can view an excellent presentation by Dr. Wurm here.

This tutorial will help InsideDNA users to get familiar with a task submission system and console-to-cloud usage on the InsideDNA platform. We know that many of you won’t be excited to read this relatively lengthy tutorial, but at least make sure you read TL;DR section or watch our video demo on youtube.

TL;DR:


  • call “tools” command to list available bioinformatics tools
  • all tools must be started from full path (/srv/dna_tools/)
  • use “isub -th ” to get tool help. For instance,  isub -th bowtie2 (note: may take several seconds for help to display)
  • submit tasks via isub wrapper (check isub --help for options)
  • monitor submitted tasks is in Tasks section (top menu) or by typying command "tasks"
  • either specify full path to your data starting from /data/userXX/ or follow the same logic as on Mac/Linux (i.e.  specify paths relative to your current directory) 
  • execute grep, awk, sed on medium and large size dataset via isub

Below you can watch a detailed video tutorial about the usage of InsideDNA terminal

 


1. Activating console

First, you need to activate your console. Log in (or sign up if you have not yet) into InsideDNA application and read Introduction Tutorial to get familiar with different options available on the website. Once you learned the basics, navigate into Files tab. In the left menu, choose Console and click on Activate button.

After a few seconds, you will have a console opened. Type “pwd” command to see you current location:

Type “ls” command to list files in your working directory:

As you see, the list is identical to files you have in File Manager and whichever operation you do on the files in console will be instantly visible in FM.

2. Viewing list of available tools

Once you activated a console, you need to check which tools are currently installed. We support many bioinformatics tools, their number will continue to grow and if you have a particular software request - please, send us an email. You have two ways of checking which tools are available. A nicely formatted tool list available on the Tools page or you can simply list all tools aliases via console by calling “alias” command:

Please, note that all tools have a prefix idna_ in front of the tool name:

If you want to search for a particular tool you can “grep” for keyword as:

3. Getting tool help

Let’s imagine that you would like to assemble three Bacterial genomes using Velvet assembler. First, you need to check whether it is present in our set of tools. Type “alias | grep "velvet"”:

You see that Velvet is assembled for different maximal kmer sizes, but we are going to work with its default version - idna_velvetg and idna_velveth. To get a help on the tool type “isub -th idna_velvetg“:

To access tool help always use the command above (isub -th) and don’t forget idna_ prefix before tool name.

4. Manipulating with data

Let’s create an empty directory for our analysis. Type “mkdir velvet_assembly”:

Now upload three bacterial genomes as a gzipped file in FM

Then “cd velvet_assembly” to the folder with gzipped file and ungzip it:

Now if you “cd” to source folder, you will see three fasta files. Let’s grep to count number of fasta records in Ralstonia_solanacearum.faa by typing “grep -c "^>" Ralstonia_solanacearum.faa”:

Please, note: if you want to execute complex grep, awk, sed queries on large files, make sure you submit them via isub command (see details below). Otherwise, you risk waiting days for an execution that otherwise would take just a couple of minutes.

5. Running Velvet to assemble a genome

You need to use a special wrapper called isub to run a tool (submit a task). This wrapper will take a command and scale our cloud-based cluster by adding a node with requested RAM/CPU. isub is a very simple command. You only need to provide a task name, number of cores, RAM and specify tool settings just as you would do on your own machine. To get help on the isub type “isub --help”:

We work with Velvet assembler, so the first command to call is idna_velveth. We will execute it on  for Ralstonia_solanacearum.faa. Type following command: “ isub -t rs_vevlveth -c 8 -r 30 -e "idna_velveth /data/user35/velvet_assembly/rs_assembly 32 -fasta -short /data/user35/velvet_assembly/source/Ralstonia_solanacearum.faa" ”.

You can monitor all submitted tasks in Tasks just like you would do with tasks submitted with a web UI.

When task is completed, verify it’s output by switching to velvet_assembly/rs_assembly/ directory.

Now, you can run a second command from the Velvet pipeline - velvetg: “ isub -t rs_vevlvetg -c 8 -r 30 -e "idna_velvetg /data/user35/velvet_assembly/rs_assembly"

When task is completed you can verify output in both File Manager:

Important note: always specify full path to your data starting from /data/userXX/, otherwise task will not be executed.

6. Submitting multiple tasks and advanced usage

The beauty of InsideDNA platform comes when you need to submit many tasks to many powerful nodes. For example, here we dealt with a single genome assembly for a tiny dataset. But, in case of assembly of medium to large size genomes, all you would need to do is to submit as many tasks with as many different settings (e.g. kmer size) as you need. There is literally no limit to how many nodes to launch and it all works smoothly right from the command line. Just to demonstrate, here is an example of submitting three genome assemblies with different k-mer sizes to three nodes each node with 8 cores with 52 Gb RAM

Submitting tasks:

Monitoring task execution:

Checking created data:

Regarding an advanced usage, you can always pipe multiple command together within -e option. Additionally, you can provide a file with a set of commands, however, in this case all the commands in this text file are going to be executed on a single node. Should you want to submit each command to a different node, wrap it with the isub and call as a bash script like in an example below.

Running Vi editor to prepare bash file:

Executing bash file:

Checking task completion:

 

Follow us on Facebook and Twitter to be the first to read our new tutorials!

Run this tool More tutorials