In this tutorial we will describe and run the tools from Kallisto program on the InsideDNA platform. This tool allows to quantify RNA-Seq data about two orders of magnitude faster than other algorithms. You may read the paper which describes technical aspects of this tool or manual for more details on running it. However, in this tutorial we will show how to run it on the InsideDNA platform.
The high speed of RNA-Seq data processing by Kallisto is achieved by avoiding alignment of reads, instead Kallisto pseudoaligns reads to a reference transcriptome. At pseudoalignment step k-mers are extracted from reads and compared with k-mers from transcriptome using a hash table. Kallisto creates a de Bruijn graph constructed from k-mers present in the transcriptome where each path corresponds to a transcript. Then k-mers of the reads are hashed to find the paths of transcriptome de Bruijn graph (transcripts) that coincide with these k-meres. This hash-based approach speeds up RNA-seq processing by avoiding time consuming alignment step. Kallisto identifies the transcripts from which the reads could have originated and does not try to find an exact alignment.
It was shown that the accuracy of Kallisto is similar to existing RNA-seq quantification tools while the speed is several orders of magnitude higher. Also, Kallisto is very simple in use because one needs to specify only the k-mer size parameter. There is a trade off associated with this parameter, because k-mer size must be large enough that random sequences do not match to the transcriptome and short enough to ensure robustness to errors.
In this section, we will run Kallisto on the test dataset. It includes 2 fastq files with forward and reverse reads respectively and fasta file with reference transcriptome. In this tutorial, we will use only InsideDNA interface to run tools, however one can do all the steps in the terminal.
Before running Kallisto we should copy this dataset from ~/tutorials_data/kallisto/ directory. To do this:
You may also do all these actions in the terminal (mkdir, cd, cp commands).
Before running pseudoalignment of reads we need to index transcriptome file. At this step transcriptome de Bruijn graph (T-DBG) is constructed. To do this we need to:
Then to quantify our RNA-Seq data we need to run pseudoalignment step. To do this:
There are several additional arguments for kallisto quant that can be useful.
To see them click "Show Secondary Parameters" button. Use --single argument for single reads; --pseudobam to output pseudoalignments in SAM format; --bootstrap-samples to bootstrap reads in the samples.
Now we can look at the output files. There are 3 files abundance.h5 (HDF5 binary file containing run info, abundance estimates etc.), abundance.tsv (plaintext file with transcripts abundance estimates) and run_info.json (json file containing running information). To read abundance.tsv file click on "Preview" button. You will see the list of transcripts ids (1st column), their length (2nd column) and abundance estimates for each transcript (4th column).
Since pseudoalignment step is very fast Kallisto gives an opportunity to quantify the uncertainty of abundance estimates via bootstraping the data. The reads are sampled with replacement from the dataset and the pseudoalignment step is repeated. After several rounds of bootstraping the variance in abundance estimates can be calculated. Low accuracy of pseudoalignment, high redundancy of reads or transcripts results in lower accuracy (thus higher variance). To run several rounds of bootstraping during pseudoalignment step you should run kallisto quant in the same manner, but add one secondary parameter:
Finally, you will see the same output files (abundance.tsv, abundance.h5 and run_info.json). Again abundance.tsv file contains transcript abundance estimates without bootstraping and all the data on bootstraping results is stored in abundance.h5 binary file. To see results of each bootstraping round we need to run another tool of Kallisto package which converts h5 format into plain text:
As a result, you will see several files in tsv format (bs_abundance_[n].tsv) each of them corresponds to abundance estimate of one bootstraping round. These files have the same structure as abundance.tsv which was described before. From these results, one can estimate the mean variance in abundance estimates and thus make some conclusions about the accuracy.
Follow us on Facebook and Twitter to be the first to read our new tutorials!Run this tool More tutorials