Console (shell, command line) is an essential tool for majority of bioinformatics tasks. As soon as you need to parse genomic data and analyze it even in a slightly non-trivial way, usage of the console is unavoidable. One of the most common issues with a console is that user is either bound to his own machine (with a limited RAM and CPU) or to a cluster that may not have all the tools installed, be overloaded with tasks, or simply would not have a node capacity necessary for a task completion. In spite of a rising popularity of containers that solve burden of tool installation, the cluster capacity and its load remain a pressing issue in bioinformatics. In this tutorial, we explain how InsideDNA resolves this issue by offering HPC/PC-like console experience, but in a cloud environment. From now on, you don’t need to worry neither about number of available nodes, nor about their capacity: should you need to assemble 100 genomes each on a 200 GB RAM and 32 cores, InsideDNA will instantly scale to a needed capacity as you submit tasks.
Bioinformatics data analysis is a computationally intense task. It is often viewed as one of the top Big Data processing tasks of today. Yet, traditional HPC environment is often not very well-tailored for bioinformatics data analysis, because HPC clusters are designed to send relatively small chunks of tasks via MPI protocol to many relatively weak nodes. In bioinformatics, on the other hand, only few programs support multi-node parallelization and many programs (e.g. genome assemblers) require a lot of RAM/CPU on each node. To get a good sense of why there is an issue, you can view an excellent presentation by Dr. Wurm here.
This tutorial will help InsideDNA users to get familiar with a task submission system and console-to-cloud usage on the InsideDNA platform. We know that many of you won’t be excited to read this relatively lengthy tutorial, but at least make sure you read TL;DR section or watch our video demo on youtube.
Below you can watch a detailed video tutorial about the usage of InsideDNA terminal
First, you need to activate your console. Log in (or sign up if you have not yet) into InsideDNA application and read Introduction Tutorial to get familiar with different options available on the website. Once you learned the basics, navigate into Files tab. In the left menu, choose Console and click on Activate button.
After a few seconds, you will have a console opened. Type “pwd” command to see you current location:
Type “ls” command to list files in your working directory:
As you see, the list is identical to files you have in File Manager and whichever operation you do on the files in console will be instantly visible in FM.
Once you activated a console, you need to check which tools are currently installed. We support many bioinformatics tools, their number will continue to grow and if you have a particular software request - please, send us an email. You have two ways of checking which tools are available. A nicely formatted tool list available on the Tools page or you can simply list all tools aliases via console by calling “alias” command:
Please, note that all tools have a prefix idna_ in front of the tool name:
If you want to search for a particular tool you can “grep” for keyword as:
Let’s imagine that you would like to assemble three Bacterial genomes using Velvet assembler. First, you need to check whether it is present in our set of tools. Type “alias | grep "velvet"”:
You see that Velvet is assembled for different maximal kmer sizes, but we are going to work with its default version - idna_velvetg and idna_velveth. To get a help on the tool type “isub -th idna_velvetg“:
To access tool help always use the command above (isub -th) and don’t forget idna_ prefix before tool name.
Let’s create an empty directory for our analysis. Type “mkdir velvet_assembly”:
Now upload three bacterial genomes as a gzipped file in FM
Then “cd velvet_assembly” to the folder with gzipped file and ungzip it:
Now if you “cd” to source folder, you will see three fasta files. Let’s grep to count number of fasta records in Ralstonia_solanacearum.faa by typing “grep -c "^>" Ralstonia_solanacearum.faa”:
Please, note: if you want to execute complex grep, awk, sed queries on large files, make sure you submit them via isub command (see details below). Otherwise, you risk waiting days for an execution that otherwise would take just a couple of minutes.
You need to use a special wrapper called isub to run a tool (submit a task). This wrapper will take a command and scale our cloud-based cluster by adding a node with requested RAM/CPU. isub is a very simple command. You only need to provide a task name, number of cores, RAM and specify tool settings just as you would do on your own machine. To get help on the isub type “isub --help”:
We work with Velvet assembler, so the first command to call is idna_velveth. We will execute it on for Ralstonia_solanacearum.faa. Type following command: “ isub -t rs_vevlveth -c 8 -r 30 -e "idna_velveth /data/user35/velvet_assembly/rs_assembly 32 -fasta -short /data/user35/velvet_assembly/source/Ralstonia_solanacearum.faa" ”.
You can monitor all submitted tasks in Tasks just like you would do with tasks submitted with a web UI.
When task is completed, verify it’s output by switching to velvet_assembly/rs_assembly/ directory.
Now, you can run a second command from the Velvet pipeline - velvetg: “ isub -t rs_vevlvetg -c 8 -r 30 -e "idna_velvetg /data/user35/velvet_assembly/rs_assembly" “
When task is completed you can verify output in both File Manager:
Important note: always specify full path to your data starting from /data/userXX/, otherwise task will not be executed.
The beauty of InsideDNA platform comes when you need to submit many tasks to many powerful nodes. For example, here we dealt with a single genome assembly for a tiny dataset. But, in case of assembly of medium to large size genomes, all you would need to do is to submit as many tasks with as many different settings (e.g. kmer size) as you need. There is literally no limit to how many nodes to launch and it all works smoothly right from the command line. Just to demonstrate, here is an example of submitting three genome assemblies with different k-mer sizes to three nodes each node with 8 cores with 52 Gb RAM
Monitoring task execution:
Checking created data:
Regarding an advanced usage, you can always pipe multiple command together within -e option. Additionally, you can provide a file with a set of commands, however, in this case all the commands in this text file are going to be executed on a single node. Should you want to submit each command to a different node, wrap it with the isub and call as a bash script like in an example below.
Running Vi editor to prepare bash file:
Executing bash file:
Checking task completion:
Follow us on Facebook and Twitter to be the first to read our new tutorials!Run this tool More tutorials