Kallisto for RNA-seq Quantification on InsideDNA Platform

Author : InsideDNA Time : 12 July 2017 Read time : 6 min

In this tutorial we will describe and run the tools from Kallisto program on the InsideDNA platform. This tool allows to quantify RNA-Seq data about two orders of magnitude faster than other algorithms. You may read the paper which describes technical aspects of this tool or manual for more details on running it. However, in this tutorial we will show how to run it on the InsideDNA platform.

The high speed of RNA-Seq data processing by Kallisto is achieved by avoiding alignment of reads, instead Kallisto pseudoaligns reads to a reference transcriptome. At pseudoalignment step k-mers are extracted from reads and compared with k-mers from transcriptome using a hash table. Kallisto creates a de Bruijn graph constructed from k-mers present in the transcriptome where each path corresponds to a transcript. Then k-mers of the reads are hashed to find the paths of transcriptome de Bruijn graph (transcripts) that coincide with these k-meres. This hash-based approach speeds up RNA-seq processing by avoiding time consuming alignment step. Kallisto identifies the transcripts from which the reads could have originated and does not try to find an exact alignment.

It was shown that the accuracy of Kallisto is similar to existing RNA-seq quantification tools while the speed is several orders of magnitude higher. Also, Kallisto is very simple in use because one needs to specify only the k-mer size parameter. There is a trade off associated with this parameter, because k-mer size must be large enough that random sequences do not match to the transcriptome and short enough to ensure robustness to errors.

In this section, we will run Kallisto on the test dataset. It includes 2 fastq files with forward and reverse reads respectively and fasta file with reference transcriptome. In this tutorial, we will use only InsideDNA interface to run tools, however one can do all the steps in the terminal.

Before running Kallisto we should copy this dataset from ~/tutorials_data/kallisto/ directory. To do this:

  • click on the "Files" button;
  • move into ~/tutorials_data/kallisto/ directory and pick all the files;
  • create kallisto working directory ("Create folder" button);
  • copy the files and paste them into working directory ("Copy" and "Paste" buttons).

You may also do all these actions in the terminal (mkdir, cd, cp commands).

Alt: Kallisto for RNA-seq quantification on InsideDNA Platform screen 1

Alt: Kallisto for RNA-seq quantification on InsideDNA Platform screen 2

Before running pseudoalignment of reads we need to index transcriptome file. At this step transcriptome de Bruijn graph (T-DBG) is constructed. To do this we need to:

  • create index output directory "ind";
  • click "Tools" button and find "kallisto index" tool in the list;
  • specify task name (1), transcriptome file to index (2), index file name (4) and output directory (3);
  • specify number of cores (5) and RAM size (6) and submit the task (7). To check if you used correct argument - use "Preview Task" to see the command line.

Alt: Kallisto for RNA-seq quantification on InsideDNA Platform screen 3

Then to quantify our RNA-Seq data we need to run pseudoalignment step. To do this:

  • create an output directory;
  • run the tool "kallisto quant";
  • specify task name (1), directory with index file and index name (2, 3), input fastq files with reads (4), output directory (5);
  • specify RAM size, core number (6) and submit the task;

There are several additional arguments for kallisto quant that can be useful.

To see them click "Show Secondary Parameters" button. Use --single argument for single reads; --pseudobam to output pseudoalignments in SAM format; --bootstrap-samples to bootstrap reads in the samples.

Now we can look at the output files. There are 3 files abundance.h5 (HDF5 binary file containing run info, abundance estimates etc.), abundance.tsv (plaintext file with transcripts abundance estimates) and run_info.json (json file containing running information). To read abundance.tsv file click on "Preview" button. You will see the list of transcripts ids (1st column), their length (2nd column) and abundance estimates for each transcript (4th column).

Kallisto for RNA-seq quantification on InsideDNA Platform screen 4

Since pseudoalignment step is very fast Kallisto gives an opportunity to quantify the uncertainty of abundance estimates via bootstraping the data. The reads are sampled with replacement from the dataset and the pseudoalignment step is repeated. After several rounds of bootstraping the variance in abundance estimates can be calculated. Low accuracy of pseudoalignment, high redundancy of reads or transcripts results in lower accuracy (thus higher variance). To run several rounds of bootstraping during pseudoalignment step you should run kallisto quant in the same manner, but add one secondary parameter:

  • create another output directory;
  • run kallisto quant with the same parameters, but with "Number of bootstrap samples" parameter activated (1) and the number specified (2);

Kallisto for RNA-seq quantification on InsideDNA Platform screen 5

Finally, you will see the same output files (abundance.tsv, abundance.h5 and run_info.json). Again abundance.tsv file contains transcript abundance estimates without bootstraping and all the data on bootstraping results is stored in abundance.h5 binary file. To see results of each bootstraping round we need to run another tool of Kallisto package which converts h5 format into plain text:

  • crate the output directory;
  • find the tool kallisto h5dump in the list of tools;
  • specify task name (1), input h5 file (2), output directory (3) and run it

Kallisto for RNA-seq quantification on InsideDNA Platform screen 6

As a result, you will see several files in tsv format (bs_abundance_[n].tsv) each of them corresponds to abundance estimate of one bootstraping round. These files have the same structure as abundance.tsv which was described before. From these results, one can estimate the mean variance in abundance estimates and thus make some conclusions about the accuracy.

Follow us on Facebook and Twitter to be the first to read our new tutorials!

Run this tool More tutorials