HISAT2 is a fast alignment program for mapping next-generation sequencing reads (both DNA and RNA). In one of our tutorials we described how to use TopHat mapper. In this tutorial we will show how to use HISAT2 for RNA-Seq reads mapping. We decided to describe alternative alignment tool because HISAT2 is faster, more computationally efficient and has some interesting features, such as draft SNVs detection in the reads. Let’s try it!
First of all, we need to open terminal window on the InsideDNA platform (TERMINAL button at the top of InsideDNA homepage).
Type several commands in the command line:
to create working directory
to move into working directory
Now we need to download several files for further analysis. It is easy to download them with wget command in the command line. However this will not give us any information about databases structure, so we will discuss how to make it step by step.
Human reference genome (hg38 assembly) can be downloaded from UCSC genome browser (https://genome.ucsc.edu/). On the home page you need to click on Downloads ->Genome Data button.
You will see a list of different species genomes and associated data. On this page you need to find Human reference genome (hg38, GRCh38) and click on Data set by chromosome button.
There you will see list of reference FASTA files. We will download only one chromosome of reference genome (chr10.fa.gz file). This will make our further computational steps faster.
Also we need to download file which contains common SNPs observed in human population. Go one step back and then click on Annotation dataset.
From the list of annotation files choose snp.147.txt.gz (contains dbSNP build 147 data).
Third file that we need is Human RNA-Seq data file which contains raw reads without coordinates. We will use the data from ENCODE Consortium project page of UCSC (https://genome.ucsc.edu/ENCODE/). The aim of ENCODE project was to identify functional elements in the human genome, so one may find the data on gene expression, regulation, DNA modification, chromatin structure etc. on this page.
You need to click on Downloads button and then choose one of RNA-Seq data links (each link corresponds to the place where this experiment was made) in the list.
Each file from the following list (see below) corresponds to experiment on specific cell line. From this list we need to choose one file in FASTQ format (for example, wgEncodeCaltechRnaSeqGm12878R1x75dFastqRep1.fastq.gz file).
All of these files must be uploaded in hisat2 directory which we created earlier (FILES ->Upload files)
We need to extract text files from archive GZ files so run “gunzip[filename]” commands for all files that we uploaded except for snp147.txt.gz. We will treat snp147.txt.gz in a different manner because this file is the heaviest one and it takes quite a long time to extract snp147.txt file from the archive. Moreover, we don’t need most of the known polymorphisms for further analysis because we use only one reference chromosome. We will use zrep command to search for string pattern in compressed snp147.txt file. With the following command we will extract all SNPs which belong to 10th chromosome:
zgrep$'chr10\t' snp147.txt.gz > /data/userXXX/hisat2/chr10.snps
where$'chr10\t' defines our search pattern (‘chr10’ + TAB); chr10.snps - output file and userXXX- your user name.
On the next step we need to change variant file format to make it acceptable for hisat2-build indexer:
isub -t snps -c 16 -r 60 -e '/srv/dna_tools/hisat2-2.0.4/hisat2_extract_snps_haplotypes_UCSC.py
/data/userXXX/hisat2/chr10.fa /data/userXXX/hisat2/chr10.snps /data/userXXX/hisat2/snps_list'
where snps_listargument specifies the name of output file (snps_list.snp) where common variants are written in the new format; chr10.snps – input file with common variants; chr10.fa – reference genome file. To see the format of output variant file type “less snps_list.snp”.
First column represents SNP identifier (NCBI ID), second column – type of variant (single nucleotide polymorphism, insertion or deletion), third column – chromosome, fourth column– positon, last column – alternative form. If one wants to obtain such variant file from VCF file – use another script hisat2_extract_snps_haplotypes_VCF.py with the same parameters.
Actually, the file with known SNPs (snps_list.snp) is optional for HISAT2; however we will use it in our pipeline to show one interesting feature of HISAT2 aligner. Prior to read alignment we need to create an index files for our reference genome. hisat2-build command is used for that purpose. It takes not only reference FASTA file as an input, but also utilizes information about known SNPs to build HISAT2 index.
isub -t his_build -c 16 -r 60 -e '/srv/dna_tools/hisat2-2.0.4/hisat2-build --
where--snp parameter specifies input variant file (snps_list.snp); chr10.fa – reference FASTA file; refer - basename of the index files to write. After long step of HISAT2 index building you will see 8 files refer.n.ht2 where n is a number from 1 to 8 in the working directory (“ls” command). Now we can run hisat2 aligner itself:
isub -t hisat -c 16 -r 60 -e '/srv/dna_tools/hisat2-2.0.4/hisat2 –x /data/user540/hisat2/refer-U
where–x refer parameter specifies basename of HISAT2 index; -S alignment.sam specifies output alignment file (by default output alignment is written in SAM format) -U specifies input FASTAQ file with unpaired reads. For those who aligns paired-end reads -1 and -2 arguments are used for FASTAQ files with first and second mate reads respectively.
Eventually we have an alignment file (alignment.sam) with reads coordinates on the reference genome to read this file type “less alignment.sam”. The structure of SAM files was described in one of our tutorials (https://insidedna.me/tutorials/view/samtools-commands-tutorial-working-sam-bam-files) and in the SAM format specification section of SAM tools site (https://samtools.github.io/hts-specs/SAMv1.pdf) so we will skip this format description.
Finally, I would like to mention that for some reads last column of our alignment file contains string of the following structure: “Zs:Z:
HISAT2 uses information about known variants and outputs all possible known SNPs which were observed in read sequences. It doesn’t necessarily mean that this SNV is present in our sample (variant calling algorithms make more accurate predictions), however, this feature of HISAT2 gives us a draft picture of observed variants. For example, the string "Zs:Z:1|S|rs3747203,97|S|rs16990981”indicates that second base of the read corresponds to a known SNP (ID: rs3747203). 97 bases after the third base (the base after the second one), the read at 100th base involves another known SNP (ID: rs16990981). 'S' indicates a single nucleotide polymorphism. 'D' and 'I' indicate a deletion and an insertion, respectively.
Follow us on Facebook and Twitter to be the first to read our new tutorials!Run this tool More tutorials